Philip McClarence

Posted on Mar 17

How AI Turns a 27-Point EXPLAIN Plan Audit Into One-Click Query Fixes

#ai #database #performance #postgres

You have a slow query. You run EXPLAIN ANALYZE. PostgreSQL hands you back a wall of text with terms like "Seq Scan," "Nested Loop," "Hash Join," "Bitmap Heap Scan," and "Materialize." Each node has estimated rows, actual rows, costs, loop counts, shared buffer hits, and disk reads. There are arrows pointing to child nodes. Some nodes have filters that removed rows. Others spilled to disk.

Most developers stare at this for thirty seconds, then close the tab.

That is a problem, because the EXPLAIN plan contains everything you need to fix the query. The information is there --- it is just buried in a tree structure that requires deep PostgreSQL internals knowledge to interpret. You need to know that a Nested Loop with a Seq Scan on the inner side means a missing index. You need to know that "Sort Method: external merge" means work_mem is too small. You need to know that a 100x mismatch between estimated and actual rows means the planner is making decisions based on stale statistics.

This article explains how mydba.dev automates this entire process: 27 rules that audit every node in the plan tree, followed by AI analysis that generates specific, copy-pasteable SQL fixes.

The 27 Insight Rules

Every EXPLAIN plan collected by mydba.dev runs through a rule-based analysis engine implemented as a PostgreSQL function. The engine walks the plan tree using a recursive CTE, inspecting each node for known performance anti-patterns.

The rules fall into several categories.

Scan Problems (Rules 1, 8, 9, 11, 16, 23)

Rule 1 --- Sequential Scan on Large Table. The most common performance problem. PostgreSQL reads every row in the table, then applies a filter to discard most of them. The rule fires when estimated rows exceed 10,000 and the filter removes more rows than it keeps:

-- This query triggers Rule 1 on a users table with 2M rows
SELECT * FROM users WHERE last_login_at > '2026-01-01';

The plan shows Seq Scan on users with Rows Removed by Filter: 1,800,000. An index on last_login_at would turn this into an Index Scan that reads only the matching rows.

Rule 16 --- Function Call Prevents Index Use. Even if you have a btree index on created_at, wrapping it in a function defeats the index entirely:

-- btree index on created_at is useless here
SELECT * FROM orders
WHERE EXTRACT(year FROM created_at) = 2026;

PostgreSQL cannot use a standard btree index when the column is inside a function call. The fix is an expression index: CREATE INDEX ON orders ((EXTRACT(year FROM created_at))).

Rule 9 --- Missing Index on Join Column. A Nested Loop where the inner side is a Seq Scan means the join column has no index. Each outer row triggers a full table scan of the inner table --- O(n*m) complexity.

Rule 23 --- Index Only Scan Needing Heap Access. When an Index Only Scan has a high heap fetch ratio, the visibility map is stale. The index has the data, but PostgreSQL must check the heap anyway. Fix: VACUUM the table.

Memory Pressure (Rules 3, 4, 6, 12, 21)

Rule 3 --- Sort Spills to Disk. When work_mem is too small for the sort, PostgreSQL writes intermediate results to temporary files. The plan shows Sort Method: external merge Disk:

-- Sorting 500K rows with default 4MB work_mem
SELECT * FROM events ORDER BY event_timestamp DESC;

The fix is straightforward: SET work_mem = '64MB'; for the session, or tune it globally if many queries spill.

Rule 4 --- Hash Spill to Disk. Hash Joins use an in-memory hash table. When it does not fit, PostgreSQL splits it into batches and writes them to disk. The plan shows Hash Batches: 16 instead of the ideal Batches: 1.

Rule 21 --- Hash Aggregate with Many Groups. A GROUP BY producing 100K+ distinct groups will likely overflow work_mem. PostgreSQL 13+ handles this with batched spills, but it is still significantly slower than in-memory aggregation.

Planner Accuracy (Rules 5, 14, 15, 26)

Rule 5 --- Row Estimate Mismatch. This is where performance tuning gets subtle. The planner estimated 100 rows, but the query actually returned 500,000. Every downstream decision --- join strategy, sort method, memory allocation --- was based on wrong data:

Seq Scan on orders (cost=0.00..35420.00 rows=100 width=64)
                    (actual time=0.015..245.320 rows=487293 loops=1)

A 4,800x mismatch. The fix: ANALYZE orders; to refresh table statistics. If the mismatch persists, you may need ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000; to increase the sample size for skewed columns.

Rule 26 --- Parallel Workers Not Launched. The planner decided parallelism would help (Workers Planned: 4) but no workers were actually available at runtime (Workers Launched: 0). Check max_parallel_workers --- other concurrent queries may have consumed the pool.

Query Structure (Rules 7, 13, 17, 22, 25, 27)

Rule 7 --- Large OFFSET Pagination. OFFSET 50000 LIMIT 20 forces PostgreSQL to scan and discard 50,000 rows before returning 20. The deeper the page, the worse it gets:

-- Page 2500 of results --- scans 50K rows to return 20
SELECT * FROM products ORDER BY product_id OFFSET 50000 LIMIT 20;

-- Fix: keyset pagination
SELECT * FROM products
WHERE product_id > 98743  -- last seen ID from previous page
ORDER BY product_id LIMIT 20;

Rule 22 --- No Partition Pruning. If you have a table partitioned by month and your WHERE clause does not include the partition key, PostgreSQL scans all partitions. The plan shows an Append node with 15 Seq Scan children instead of 1.

Rule 17 --- Correlated Subquery. A SubPlan node that executes once per outer row. The plan shows SubPlan 1 with loops=47293 --- the subquery ran 47,293 times. Rewriting as a JOIN or lateral join eliminates the per-row execution.

Rule 25 --- Recursive CTE with Excessive Iterations. A WorkTable Scan with 10,000+ loops indicates a recursive query that is traversing a large graph or missing a termination condition.

Join and Index Efficiency (Rules 2, 19, 20, 24)

Rule 2 --- Nested Loop with High Row Count. The inner side of a Nested Loop processes thousands of rows per loop across many loops. This is different from Rule 9 (which detects Seq Scan on the inner side) --- Rule 2 fires even when an index is used but returns too many rows per lookup.

Rule 19 --- Merge Join Sorting Both Inputs. A Merge Join requires sorted inputs. If neither side has an index providing sorted output, PostgreSQL adds Sort nodes to both sides. Two sorts are often more expensive than a single Hash Join.

Rule 20 --- Multiple Bitmap Indexes Combined. BitmapAnd or BitmapOr nodes indicate PostgreSQL is combining multiple single-column indexes. A single composite index covering all the filter conditions would be more efficient.

The remaining rules cover JIT compilation overhead (Rule 14), temporary file writes (Rule 12), lossy bitmap scans (Rule 6), large estimated sorts (Rule 18), and impossible query conditions (Rule 27).

How the AI Analysis Works

The 27 rules are deterministic --- they fire based on thresholds and node types. They tell you what is wrong. But they cannot tell you the specific SQL to fix it, because that requires understanding the query semantics, the table schema, the existing indexes, and the PostgreSQL configuration.

That is where the AI analysis comes in.

When you click "AI Analysis" in mydba.dev, the system assembles a context package containing:

1. The SQL query text --- the full, normalized query template.

2. Execution statistics --- from pg_stat_statements: total calls, mean/min/max execution time, rows returned, shared buffer hits vs disk reads, temp blocks written.

3. The full EXPLAIN plan tree --- every node with costs, row estimates, actual rows (if EXPLAIN ANALYZE), buffer usage, sort methods, and loop counts.

4. Rule-based insight findings --- which of the 27 rules fired and why, with affected tables, row counts, and score deductions.

5. Table schemas --- column definitions, data types, constraints, and row counts for every table referenced by the query.

6. Existing indexes --- all indexes on the referenced tables, including column lists, index types (btree, hash, GIN, GiST), and whether they are unique or partial.

7. PostgreSQL configuration --- the settings that affect query planning: work_mem, effective_cache_size, random_page_cost, shared_buffers, max_parallel_workers_per_gather, and the PostgreSQL version.

8. Plan history --- previous EXPLAIN plans for the same query template, enabling regression detection ("this query used an Index Scan last week but switched to Seq Scan today").

This context is sent to an AI model (your choice of OpenAI, Anthropic, or the built-in provider), which returns a structured JSON response with prioritized recommendations, each including the exact SQL to execute, the expected improvement factor, and a risk assessment.

Example: The NOT IN Anti-Join

Consider this query that finds users who have not placed any completed orders:

SELECT u.user_id, u.email, u.username
FROM users u
WHERE u.user_id NOT IN (
    SELECT o.user_id FROM orders o WHERE o.status = 'completed'
);

The rule engine scores this D (65/100) and fires three insights:

Rule 1 (Sequential Scan on Large Table): Seq Scan on users with 2M rows. The filter removes 1.8M rows.
Rule 17 (Correlated Subquery): The NOT IN subquery materializes the full result set, then checks every user row against it. With NULL handling, PostgreSQL cannot optimize this into a simple anti-join.
Rule 5 (Row Estimate Mismatch): The planner estimated 100 matching rows; the actual result was 47,293.

The AI analysis examines the context and returns two recommendations:

Recommendation 1 --- Query Rewrite (Critical). Rewrite NOT IN as a LEFT JOIN ... IS NULL or NOT EXISTS pattern:

-- Option A: LEFT JOIN anti-join
SELECT u.user_id, u.email, u.username
FROM users u
LEFT JOIN orders o ON o.user_id = u.user_id AND o.status = 'completed'
WHERE o.user_id IS NULL;

-- Option B: NOT EXISTS (often preferred by the planner)
SELECT u.user_id, u.email, u.username
FROM users u
WHERE NOT EXISTS (
    SELECT 1 FROM orders o
    WHERE o.user_id = u.user_id AND o.status = 'completed'
);

The AI explains why: NOT IN has problematic NULL semantics. If any orders.user_id is NULL, the entire NOT IN returns no rows. PostgreSQL must account for this possibility, preventing it from using an efficient anti-join strategy. NOT EXISTS and LEFT JOIN ... IS NULL do not have this problem.

Recommendation 2 --- Index (High). Add a composite index:

CREATE INDEX idx_orders_user_id_status
ON orders (user_id, status);

This covers both the join column and the filter predicate, enabling an Index Only Scan on the orders side of the anti-join.

Expected improvement: 10-50x, depending on table sizes and cache state.

Example: The Missing Join Index

Here is a simpler case --- a product catalog query:

SELECT p.product_name, oi.quantity
FROM products p
JOIN order_items oi ON oi.product_id = p.product_id
WHERE p.category = 'electronics';

The plan shows a Nested Loop: Index Scan on products (using an index on category) as the outer side, and a Seq Scan on order_items (1.2M rows) as the inner side. For each of the 3,400 electronics products, PostgreSQL scans all 1.2 million order items. That is 4 billion row comparisons.

The rule engine fires:

Rule 9 (Missing Index on Join Column): Nested Loop performs sequential scan on order_items (1,200,000 rows) for each outer row. Score deduction: 20 points.
Rule 1 (Sequential Scan on Large Table): Seq Scan on order_items with filter removing 1,196,000 rows per loop.

The AI recommendation is a single index:

CREATE INDEX idx_order_items_product_id
ON order_items (product_id);

With this index, the Nested Loop switches from Seq Scan to Index Scan on the inner side. Each outer row triggers a fast B-tree lookup instead of a full table scan. The plan cost drops from 4,000,000 to about 12,000 --- a 300x improvement.

The AI also notes that a covering index including quantity would enable Index Only Scans:

CREATE INDEX idx_order_items_product_id_quantity
ON order_items (product_id) INCLUDE (quantity);

Why Rules + AI, Not Just AI

A reasonable question: why not skip the rules and send the raw EXPLAIN plan directly to an AI model?

Three reasons.

Determinism. The rules produce the same result every time for the same plan. They are unit-testable. They do not hallucinate. If Rule 3 says the sort spilled to disk, it did --- because the plan literally says Sort Method: external merge Disk.

Speed. The rule engine runs as a PostgreSQL function in under 10ms. It executes on every collected plan, every 5 minutes, for every monitored query. Sending every plan to an AI API would be expensive and slow.

Focus. The AI gets a pre-filtered context. Instead of "here is a plan, find problems," it gets "here are three specific problems we already identified, plus the full context --- now tell me exactly how to fix them." The recommendations are more specific because the AI is not guessing what might be wrong.

The rules tell you what is wrong. The AI tells you how to fix it. Together, they turn a 300-line EXPLAIN plan into a prioritized list of SQL statements you can copy, paste, and run.

Try It

mydba.dev monitors your PostgreSQL databases continuously, collecting EXPLAIN plans for your top queries automatically. Every plan runs through the 27-point audit. When you are ready for specific fixes, one click sends the full context to AI analysis.

The rule engine and plan collection are included in the free tier. AI analysis works with your own OpenAI or Anthropic API key, or the built-in provider.

DEV Community