Automatic EXPLAIN Plan Collection: Stop Guessing Why Queries Are Slow

#automation #database #monitoring #performance

Automatic EXPLAIN Plan Collection: Stop Guessing Why Queries Are Slow

If you have ever scrambled to run EXPLAIN during a production incident, you already know the fundamental problem: by the time you are debugging, you have no baseline to compare against. The plan you are looking at right now might be perfectly normal, or it might be radically different from what it was yesterday. Without historical plans, you are guessing.

The Problem

Most teams never run EXPLAIN until something is on fire. A query degrades gradually over weeks as a table grows from 2 million to 20 million rows, the planner's cost estimates shift, and execution time creeps from 50ms to 500ms. Nobody notices because the change is incremental — until one day it crosses a threshold and response times spike.

The worse scenario is a sudden plan change. PostgreSQL runs ANALYZE, statistics shift slightly, and the planner switches from an index scan to a hash join. A query that ran in 5ms now takes 250ms. You know something got slow, but you have no baseline plan to compare against because nobody was collecting them.

Manual EXPLAIN requires you to know which query to check and when to check it. You have hundreds of query templates running against dozens of tables, and the idea of manually running EXPLAIN on all of them every few hours is not realistic.

How to Detect It

You can manually run EXPLAIN on individual queries to inspect the execution plan:

-- Basic EXPLAIN shows the planner's chosen strategy
EXPLAIN (FORMAT JSON, VERBOSE)
SELECT o.order_id, c.customer_name
FROM orders o
JOIN customers c ON c.customer_id = o.customer_id
WHERE o.order_date > '2025-01-01'
  AND o.total_amount > 500;

-- EXPLAIN ANALYZE actually executes the query and shows real timing
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM events
WHERE event_type = 'purchase'
  AND created_at > now() - interval '7 days';

Look for these warning signs in the output:

Seq Scan on large tables where an index scan is expected
Nested Loop with high actual row counts (indicates bad cardinality estimates)
Hash Join with a build side larger than work_mem (spills to disk)
Sort with Sort Method: external merge (also spilling to disk)
Large gap between rows= estimates and actual rows (planner misjudging selectivity)

The gap in this approach is coverage. Running EXPLAIN manually tells you about one query at one point in time. To catch regressions, you would need to run EXPLAIN on every active query template periodically, store the results, and compare them over time.

On PostgreSQL 16 and later, EXPLAIN (GENERIC_PLAN) generates a plan without executing the query and without needing specific parameter values. This makes automated collection safe — no risk of accidentally running a destructive query or waiting for a slow one to complete.

Making EXPLAIN Collection Automated

The ideal approach is to collect EXPLAIN plans automatically on a schedule — say every 5 minutes for the top 50 slowest queries. On PostgreSQL 16+, GENERIC_PLAN makes this safe for any query, including write-heavy statements.

Once you have continuous plan collection, you gain several capabilities:

Performance grading: Each plan can be analyzed and assigned a grade from A (optimal) to F (critical issues), based on node types, buffer usage, row estimate accuracy, and spill-to-disk operations.
Historical comparison: When a query degrades, you can compare the current plan to the one from last week and see exactly what changed.
Proactive reviews: Scan all query grades weekly and catch degradations before they become incidents.

A monitoring tool that collects EXPLAIN plans automatically and grades them eliminates the coverage gap entirely. You are no longer relying on someone remembering to run EXPLAIN — every significant query has a plan history.

How to Fix It

When you find a query with a poor plan, the fix depends on what the plan reveals:

High estimated vs. actual rows (bad cardinality estimate):

-- Update statistics to give the planner better data
ANALYZE orders;

-- For columns with non-uniform distributions, increase statistics target
ALTER TABLE orders ALTER COLUMN customer_id SET STATISTICS 1000;
ANALYZE orders;

Sequential scan where an index scan is expected:

-- Create the missing index
CREATE INDEX CONCURRENTLY idx_orders_date_amount
    ON orders (order_date, total_amount);

Hash join or sort spilling to disk:

-- Increase work_mem for the session (not globally)
SET work_mem = '256MB';
-- Then re-run the query and verify the plan changed

Understanding performance grades:

A-B: Plan is reasonable. The query is using indexes efficiently and estimates are accurate. No action needed.
C: Minor inefficiencies. Perhaps a slightly suboptimal join order or an index scan that could be an index-only scan. Worth investigating but not urgent.
D-F: Significant problems. Sequential scans on large tables, sort or hash operations spilling to disk, nested loops with high row counts. These queries are candidates for immediate optimization.

How to Prevent It

Make EXPLAIN plan collection automatic rather than reactive. Waiting for an incident to run EXPLAIN means you are always debugging without a baseline. Continuous plan collection gives you a historical record — when a query degrades, you can compare the current plan to the one from last week and see exactly what changed.

Review query grades weekly. A query that drops from B to D between deploys likely has a new access pattern or is hitting a table that grew past a planner threshold. Catching this during a routine review is vastly preferable to catching it during a 2am incident.

Set track_activity_query_size to at least 4096 (default is 1024) so that long queries are not truncated in pg_stat_activity. Truncated queries cannot be EXPLAINed, and most monitoring tools will silently lose that data.