Philip McClarence

Posted on Apr 24

Reading PostgreSQL EXPLAIN and EXPLAIN ANALYZE Output

#database #performance #postgres #tutorial

Every PostgreSQL performance conversation eventually lands on a question that sounds trivial: what does this EXPLAIN mean? The output is almost readable. There are node names in English, numbers that look familiar, and enough structure that you can guess at the intent. But if you're guessing, you're going to miss the signal that actually matters — and the difference between a plan that returns in 0.3 ms and one that returns in 400 ms is often one line of EXPLAIN output that looks like boilerplate.

This article is a systematic walk through how to read an EXPLAIN plan on PostgreSQL 17, using real output captured from a live database. By the end you should be able to look at a plan, identify what each node is doing and why, spot the three places where things usually go wrong, and articulate in one sentence why the query is slow — or whether it's actually fine and something else is wrong.

This is part of the Complete Guide to PostgreSQL SQL Query Analysis & Optimization series.

EXPLAIN vs EXPLAIN ANALYZE vs EXPLAIN (ANALYZE, BUFFERS)

The three variants you'll use in practice:

EXPLAIN — asks the planner what it would do, without running the query. Fast (milliseconds), safe for expensive queries, but every number is an estimate. Useful for "how expensive does the planner think this is?" and "did my new index change the plan shape?"

EXPLAIN ANALYZE — actually runs the query and reports what happened. You get both the planner's estimates and the real measured results, side by side. Use this in development and staging; use it on production only after thinking about the cost. Three warnings: (1) EXPLAIN ANALYZE on an INSERT/UPDATE/DELETE will execute the DML — wrap in a BEGIN; ... ROLLBACK; if you don't want the side effects. (2) The query runs end-to-end, so a slow query is slow again, and any locks it takes are held for real. (3) ANALYZE pulls rows into the buffer cache and may evict other working-set pages; running it on a busy production system can perturb the performance of the exact thing you're measuring. On hot-path queries, prefer capturing a representative plan via auto_explain or an EXPLAIN visualiser in a monitoring tool rather than running EXPLAIN ANALYZE ad-hoc under load.

EXPLAIN (ANALYZE, BUFFERS, VERBOSE, SETTINGS) — the version you should default to. BUFFERS adds per-node cache-hit/read/dirtied counts and must still be specified explicitly; EXPLAIN ANALYZE on its own does not include buffer statistics. VERBOSE adds the output column list at each node (useful for spotting why indexes aren't being chosen). SETTINGS reports any non-default planner knobs that might be influencing the plan.

You can also ask for structured output with FORMAT JSON, FORMAT YAML, or FORMAT XML. JSON preserves every field and is what you want for programmatic analysis; the text format is easier to read inline.

The plan tree

Every EXPLAIN output is a tree. The root is the outermost node, which is whatever produces the query's final rows; children feed their output up to their parent. PostgreSQL indents children under their parent with arrows:

Parent Node
  ->  Child A
  ->  Child B
        ->  Grandchild

The top-down narrative is: "to produce Parent Node's output, PostgreSQL runs Child A and Child B, feeding both into the parent. Child B itself is produced by running Grandchild." Execution order is bottom-up (leaves run first), but the way to read the plan is top-down — start with "what is this query ultimately asking for?" and then follow the tree down to understand how PostgreSQL intends to answer.

Here's a real example — "show twenty recent pending orders with the user's email." The plan is against a 500,000-row sim_bp_orders table and 200,000-row sim_bp_users table on PostgreSQL 17.8:

Limit  (cost=0.85..16.38 rows=20 width=37) (actual time=0.075..0.277 rows=20 loops=1)
  Buffers: shared hit=211
  ->  Nested Loop  (cost=0.85..77853.84 rows=100300 width=37) (actual time=0.074..0.275 rows=20 loops=1)
        Inner Unique: true
        Buffers: shared hit=211
        ->  Index Scan Backward using idx_sim_bp_orders_created_at on sim_bp_orders o
              (cost=0.42..30949.29 rows=100300 width=20)
              (actual time=0.012..0.151 rows=20 loops=1)
              Filter: ((status)::text = 'pending'::text)
              Rows Removed by Filter: 106
              Buffers: shared hit=131
        ->  Memoize  (cost=0.43..0.55 rows=1 width=25) (actual time=0.006..0.006 rows=1 loops=20)
              Cache Key: o.user_id
              Cache Mode: logical
              Hits: 0  Misses: 20  Evictions: 0  Overflows: 0  Memory Usage: 3kB
              Buffers: shared hit=80
              ->  Index Scan using sim_bp_users_pkey on sim_bp_users u
                    (cost=0.42..0.54 rows=1 width=25)
                    (actual time=0.003..0.003 rows=1 loops=20)
                    Index Cond: (u.user_id = o.user_id)
                    Buffers: shared hit=80
 Planning Time: 1.183 ms
 Execution Time: 0.309 ms

Read top-down. The root is Limit, which caps the result at twenty rows. Below it is a Nested Loop that joins two sources: an Index Scan Backward over sim_bp_orders and a Memoize wrapping an Index Scan on sim_bp_users. The outer loop walks the orders index backwards (newest first) filtering for status = 'pending', and for each matching order, looks up the user via the primary-key index — but the Memoize caches results by user_id in case the same user appears multiple times (they don't in this particular run, so all 20 are cache misses).

This is a very good plan. 0.309 ms, 211 shared-buffer hits, no reads from disk. The LIMIT 20 short-circuits the nested loop early — only 106 rows are read and filtered before twenty matches are found. The same query with a much larger LIMIT would have very different numbers.

Now let's break down what each number means.

Per-node fields: cost, rows, width, time, loops

On every node, PostgreSQL prints something like:

Node Name  (cost=S..T rows=R width=W) (actual time=s..t rows=r loops=l)

The first parenthesis (cost=... rows=... width=...) is the planner's estimate. The second (actual time=... rows=... loops=...) is what actually happened when the query ran. EXPLAIN without ANALYZE only prints the first.

cost=startup..total. Dimensionless units, scaled relative to seq_page_cost (1.0 by default). The other cost GUCs — random_page_cost, cpu_tuple_cost, cpu_index_tuple_cost, cpu_operator_cost — are all expressed in the same arbitrary unit, which lets the planner compare heterogeneous operations against each other. startup is the estimated cost to produce the first row from this node; total is the estimated cost to produce all rows. The difference matters: a Sort node has a high startup cost (it has to consume all input before it can produce the first row) but a low marginal cost per row after that. An Index Scan has a very low startup cost. When you see a node above a LIMIT, what matters is the startup cost of the child, because the limit stops asking for rows as soon as it has enough.

rows=N. The planner's estimate of how many rows this node will emit. Per loop — see below.

width=W. Estimated average row width in bytes. Mostly informational; you use it to sanity-check whether a Sort or Hash might spill to disk (row width × estimated rows ≈ memory requirement).

actual time=startup..total. Wall-clock milliseconds, measured per loop. startup is the time to produce the first row from this node; total is the time to produce the last row.

actual rows=r loops=l. rows is the number of rows produced per loop, averaged over all l loops. To get the total rows this node emitted, multiply: rows × loops.

Loops matter. In the nested loop example above, the Memoize node reports rows=1 loops=20 — meaning the node was executed 20 times (once per outer row), and each execution produced 1 row. Total output: 20 rows. But the actual time=0.006..0.006 is per loop, so the total time spent in Memoize was about 0.006 ms × 20 = 0.12 ms. Forgetting to multiply by loops is the single most common mistake in reading EXPLAIN output — a node that looks fast per loop can still dominate the query time if it runs 50,000 times.

The relationship between rows estimate and actual rows is arguably the most important signal in a plan. If the planner estimated 15 and the actual was 8,000, the plan was built on bad assumptions: every decision it made downstream (join strategy, memory allocation, whether to parallelise) was wrong. A ratio past 10× in either direction is worth treating as a warning; past 100× it's usually critical. The fix is almost always ANALYZE on the affected table, or extended statistics if the bad estimate comes from correlated columns that the planner assumes are independent.

Node types you'll see most often

Scan nodes — where rows enter the plan.

Seq Scan — read every row of a table. Reports Filter: when there's a WHERE clause applied, and Rows Removed by Filter: telling you how many rows were read and discarded. Cheap on small tables, catastrophic on large ones with selective filters.
Index Scan — use an index to find rows, then fetch each matching row from the heap for any columns the index doesn't contain. Reports Index Cond: for conditions satisfied by the index, and optionally Filter: for conditions that have to be rechecked after the heap fetch.
Index Only Scan — use an index to find rows and return all requested columns directly from the index, skipping the heap entirely. Requires either that the index includes every referenced column (see INCLUDE) or that all columns are part of the index keys. Reports Heap Fetches: — this number should be close to zero; a non-zero count means the visibility map didn't cover some pages and PostgreSQL had to check the heap anyway, defeating the point.
Bitmap Index Scan + Bitmap Heap Scan — two-step pattern for combining multiple index conditions or for queries that match many rows. First, the index scan builds a bitmap of heap pages that might have matches. Then the heap scan visits those pages once each, avoiding re-reading pages that contain multiple matches. Reports Exact Heap Blocks and Lossy Heap Blocks — a high lossy-block count means work_mem was too small to track individual tuples, so PostgreSQL fell back to page-level tracking and has to re-filter the matches.

Join nodes — combining two inputs.

Nested Loop — for each row on the outer side, scan the inner side. Optimal when the outer side is small and the inner side has an index on the join column. Pathological when both sides are large.
Hash Join — build a hash table from the smaller side (the Hash child), then probe it with each row from the other side. Optimal for equi-joins on unordered data when the smaller side fits in work_mem. Reports Hash Batches: — if this is greater than 1, the hash table didn't fit in memory and had to spill.
Merge Join — two pre-sorted inputs, walked in parallel. Optimal when both sides are already sorted (or can be sorted cheaply via an index). Reports Merge Cond:.

Sort and aggregation nodes.

Sort — ordering rows. Reports Sort Key: (the columns being sorted), Sort Method: (algorithm), Sort Space Type: (Memory or Disk), and Sort Space Used: (in KB).
- top-N heapsort — used under a LIMIT N. Keeps only N rows in a heap regardless of input size. Efficient in memory and time.
- quicksort — everything fits in work_mem.
- external merge — didn't fit; spilled to disk.
Aggregate / HashAggregate / GroupAggregate — SUM/AVG/COUNT/GROUP BY. HashAggregate builds a hash table keyed by the group-by columns; GroupAggregate requires presorted input. HashAggregate can spill to disk with Planned Partitions: N Batches: M.
Limit — cap the number of rows. Often the shortcut that makes a plan fast.
WindowAgg — window functions like ROW_NUMBER() and SUM() OVER.

Parallelism.

Gather / Gather Merge — the leader process collecting results from parallel workers. Workers Planned: and Workers Launched: tell you how many workers the planner asked for vs actually got. When Launched < Planned, the system is short on parallel worker slots.
Parallel Seq Scan / Parallel Index Scan / Parallel Hash Join — parallel-aware variants of the base node types.

Utility nodes.

Materialize — cache an intermediate result so the parent can rescan it without redoing the work. Common above the inner side of a Nested Loop.
Memoize (new in PostgreSQL 14) — LRU cache above an inner loop. Reports Cache Key:, Hits:, Misses:, Evictions:, and Memory Usage:. A high hit ratio is good; a high miss ratio just means the cache didn't help this particular query but didn't hurt either.
CTE Scan — reading from a materialised CTE. In PostgreSQL 12+ most CTEs are inlined and this node disappears; you see it when a CTE is referenced multiple times or marked MATERIALIZED.
SubPlan — a correlated subquery, executed once per outer row. Almost always worth rewriting as a JOIN.

The Buffers line

With BUFFERS enabled, every node reports how many 8 KB pages it touched:

Buffers: shared hit=3689

The four counters to know:

shared hit=N — pages found in shared_buffers (PostgreSQL's cache). No I/O system calls.
shared read=N — pages the backend had to read into shared_buffers via a read() system call. Whether the OS page cache satisfied the read without touching disk is invisible to EXPLAIN — these show up as reads regardless.
shared dirtied=N — pages the query modified in cache. Common with DML; in a read-only SELECT, usually comes from hint-bit updates or cleanup.
shared written=N — pages written back out during this node's execution. Usually this is the backend itself being forced to evict dirty pages to make room for new ones, not the background writer — so a high written count means your query is doing someone else's work because the dirty-page pool was already full.

There's also local hit/read/dirtied/written for per-session temporary tables, and temp read=N written=N for work files produced by sorts and hash joins that spilled.

A query doing shared read=2016, temp written=2051 in a single node is telling you two things: the table isn't fitting in cache, and the query itself is generating its own on-disk temp files because some operation (hash, sort, bitmap) exceeded work_mem. Both are fixable; both hurt.

A harder plan: the HashAggregate spill

Here's a plan with more going on — "the twenty users with the most pending-or-shipped orders." Against the same 500,000-row orders table and 200,000-row users table:

Limit  (cost=42281.85..42282.10 rows=20 width=29) (actual time=408.141..408.145 rows=20 loops=1)
  Buffers: shared hit=3737 read=2016, temp read=1320 written=2051
  ->  Sort  (cost=42281.85..42722.81 rows=176383 width=29) (actual time=406.664..406.667 rows=20 loops=1)
        Sort Key: (count(*)) DESC
        Sort Method: top-N heapsort  Memory: 26kB
        ->  HashAggregate  (cost=33757.54..37588.36 rows=176383 width=29)
              (actual time=347.354..392.138 rows=117060 loops=1)
              Group Key: u.email
              Planned Partitions: 4  Batches: 5  Memory Usage: 8241kB  Disk Usage: 6920kB
              ->  Hash Join  (cost=7932.00..21080.02 rows=176383 width=21)
                    (actual time=140.974..285.275 rows=175263 loops=1)
                    Hash Cond: (o.user_id = u.user_id)
                    ->  Seq Scan on sim_bp_orders o  (cost=0.00..9939.00 rows=176383 width=4)
                          (actual time=0.018..51.809 rows=175263 loops=1)
                          Filter: ((status)::text = ANY ('{pending,shipped}'::text[]))
                          Rows Removed by Filter: 324737
                    ->  Hash  (cost=4064.00..4064.00 rows=200000 width=25)
                          (actual time=140.864..140.865 rows=200000 loops=1)
                          Buckets: 131072  Batches: 2  Memory Usage: 6822kB
                          ->  Seq Scan on sim_bp_users u  (cost=0.00..4064.00 rows=200000 width=25)
                                (actual time=1.735..87.218 rows=200000 loops=1)
 Planning Time: 1.107 ms
 Execution Time: 408.215 ms

408 ms. Let's read it top-down and find where the time actually goes.

Root: Limit + Sort. The Sort is top-N heapsort, Memory: 26 kB — fine. Under the LIMIT 20, a top-N sort is almost free regardless of input size.

HashAggregate — the first red flag. The Group Key is u.email; the aggregate is a count(*) across the 175k joined rows. Two numbers jump out: Planned Partitions: 4 Batches: 5 and Memory Usage: 8241 kB Disk Usage: 6920 kB. PostgreSQL 13+ can spill a HashAggregate to disk when the hash table exceeds work_mem: the executor detects that not all groups will fit in memory, writes unfinished groups out to per-partition spill files, and processes them in a second pass. The exact number of spill-and-resume cycles isn't something you should read literally from the Batches count, but the presence of Disk Usage at all is the signal — this query is paying for temp file I/O on every run. The temp written=2051 buffer count at the top is driven by exactly this, and this is the dominant cost of the query.

Hash Join + Hash child — the second red flag. Buckets: 131072 Batches: 2 Memory Usage: 6822 kB. The hash table built from sim_bp_users needed about 13 MB (the build side is 200k rows at ~64 bytes each) and didn't fit in 4 MB of work_mem. When a hash join spills, PostgreSQL partitions both sides by the join key and processes one matched pair of partitions at a time — each probe row is tested only against its matching partition, not against every batch. The cost is the extra I/O of writing the build and probe sides to per-partition temp files and reading them back.

Seq Scan on sim_bp_orders. 175k rows returned, 324k removed by filter (total = 500k, the whole table). The filter is status IN ('pending', 'shipped'). No index on status, so the whole table is scanned.

Seq Scan on sim_bp_users. 200k rows returned, no filter — we need all users. Reads 2016 pages from disk (shared read=2016); the users table is mostly cold in cache.

The bottleneck order, from biggest to smallest: HashAggregate spill, Hash Join build-side batches, Seq Scans. Three different fixes are plausible, and which one is appropriate depends on how often this query runs, how much work_mem the rest of the workload can tolerate, and whether the data is append-mostly:

Raise work_mem per-session to ~20 MB so both the HashAggregate and the Hash Join stay in memory. Caveat: work_mem is allocated per sort/hash node per connection, so raising it globally multiplies by the number of concurrent queries doing sorts. Set it per-role (ALTER ROLE dashboard SET work_mem = '32MB') or per-session in the dashboard's connection pool, not cluster-wide.
Index sim_bp_orders.status so the scan becomes a Bitmap or Index Scan instead of reading all 500k rows. At ~35% selectivity a plain btree might not beat a seq scan by much, but a partial index or a multi-column (status, user_id) would.
Materialise the aggregate into a small summary table refreshed on a schedule or via triggers, if the query is a dashboard that runs every 10 seconds and the underlying data is append-mostly.

A fair DBA answer is "measure each fix in isolation and pick based on the workload" — not any specific prescribed order. If the query runs once a day in a reporting job, the work_mem bump is cheapest; if it runs constantly and powers a UI, the materialised result wins.

The five most common mistakes in reading plans

Comparing rows without multiplying by loops. A node reporting rows=1 loops=50000 produced 50,000 rows. A node reporting rows=50000 loops=1 produced the same 50,000 rows in a very different shape. Always look at loops.
Looking at top-line cost/time and calling it a day. The top-line number tells you the query is slow; it doesn't tell you which node is slow. Scan the tree for the node with the highest actual time × loops — that's where the time is spent, and usually where the fix is.
Trusting the planner's estimates when actual rows disagrees. If rows=15 on the estimate and actual rows=8000, every downstream decision was built on the wrong premise. Don't try to understand why the plan is shaped the way it is until you've fixed the estimate (usually with ANALYZE or extended statistics).
Missing the Rows Removed by Filter line. A Seq Scan returning a reasonable number of rows looks fine — until you notice the filter line says ten million rows were read and discarded to produce those few. The scan was fine; the cost is in the discard.
Ignoring the Buffers line. Two plans can have identical shapes and wildly different performance if one hits cache and the other doesn't. shared hit=5 means "hot"; shared read=50000 means "the storage layer did all the work, and next time it might be even worse." The Buffers line is the only way to see this without looking at the timing.

Next steps

If the first plan in this article (the nested loop) looked straightforward and the second (the HashAggregate spill) made sense, you've mostly got it. The rest of the series digs into specific bottleneck categories — missing indexes, join-strategy mistakes, aggregate spills, non-sargable WHERE clauses — and what to do about each. The next piece is PostgreSQL Index Usage and Optimization.