Haleh

Posted on May 20

Stop Reading EXPLAIN Plans by Hand: Introducing pgexplain and pgwatch 🐘

#postgres #go #cli #database

PostgreSQL gives you a detailed execution plan for every query. Reading it is a skill — interpreting it correctly under pressure, at scale, or across dozens of slow queries is another thing entirely. Most developers either skip the plan entirely or paste it into an online visualizer and hope for the best.

The problem

You run a slow query. PostgreSQL hands you this:

[
  {
    "Plan": {
      "Node Type": "Hash Join",
      "Actual Rows": 923847,
      "Actual Loops": 1,
      "Plans": [
        { "Node Type": "Seq Scan", "Relation Name": "orders", "Actual Rows": 1000000 },
        { "Node Type": "Hash", "Batches": 8, "Memory Usage": 4096 }
      ]
    },
    "Execution Time": 1243.821
  }
]

Now what? The plan is telling you something. A sequential scan on a million-row table. A hash join that spilled across 8 batches. But connecting those dots to a concrete action — add an index here, raise work_mem there — takes experience that not everyone has, and time that no one has enough of.

pgexplain: automated plan analysis

pgexplain is a Go library and CLI that parses EXPLAIN (ANALYZE, FORMAT JSON) output and surfaces actionable findings.

$ pgexplain plan.json

[WARN]  sequential scan on "orders" discards 8332x more rows than it returns
  node:       Seq Scan (ID 1)
  detail:     PostgreSQL read 100000 rows from "orders" but only 12 matched
              (customer_id = 42) (8332 rows discarded per row returned).
  suggestion: Add an index on "orders" to support the filter (customer_id = 42).
              Run EXPLAIN (ANALYZE, BUFFERS) after adding the index to confirm it is used.

1 finding: 0 error(s), 1 warning(s), 0 info

Instead of staring at a wall of JSON, you get a ranked list of what's wrong and what to do about it.

Supported rules

Rule	What it catches
`SeqScan`	Sequential scan that discards far more rows than it returns
`RowEstimateMismatch`	Planner estimates off by 10× or more
`HashJoinSpill`	Hash joins that spill to disk
`NestedLoopLarge`	Nested loops with large outer input
`MissingIndexOnlyScan`	Heap fetches defeating an index-only scan
`SortSpill`	Sort operations that spill to disk
`TopNHeapsort`	LIMIT queries using slow heapsort
`ParallelNotLaunched`	Parallel plans where workers never started
`MergeJoinUnsortedInputs`	Merge Join with explicit Sort children
`HighTempBlockIO`	High temp block I/O from aggregations, window functions, CTEs

Use it in CI

pgexplain exits with code 1 if any Warn or Error findings are found, which makes it a drop-in CI gate:

psql -U myuser -d mydb \
  -c "EXPLAIN (ANALYZE, FORMAT JSON) SELECT * FROM orders WHERE customer_id = 42" \
  | pgexplain || exit 1

Gate your pull requests on query plan quality, not just correctness.

Use it as a library

plan, _ := parser.Parse(explainJSON)

adv := advisor.New(
    rules.SeqScan(),
    rules.RowEstimateMismatch(),
    rules.HashJoinSpill(),
    // ...
)

for _, f := range adv.Analyze(plan) {
    fmt.Printf("[%s] %s\n  → %s\n", f.Severity, f.Message, f.Suggestion)
}

Embed it in your own slow query logger, migration runner, or developer CLI.

Install:

go install github.com/bright98/pgexplain/cmd/pgexplain@latest
# or as a library
go get github.com/bright98/pgexplain

pgwatch: continuous slow query monitoring

Catching bad plans during development is only half the battle. On a live server, slow queries happen continuously — and most teams only notice them after a user complains.

pgwatch is a daemon that tails your PostgreSQL log file, extracts the auto_explain plans that PostgreSQL writes for every slow query, and feeds them through pgexplain automatically.

PostgreSQL log file
      │
      ▼
pgwatch  ←─── tails & parses auto_explain JSON blocks
      │
      ▼
pgexplain rule engine  ←─── detects the real problems
      │
      ▼
terminal / JSON / HTML report

No database connection required. It's a pure log reader — it never executes EXPLAIN itself.

What the output looks like

=== pgwatch report — Wed, 13 May 2026 14:00:00 UTC ===

#1  2026-05-10 14:23:01 UTC  myuser@mydb  duration=1243.82ms  [auto_explain]
    [ERROR] node=1 (Hash Join) — hash batch spill to disk
             Inner side wrote 42 MB to temp files across 8 batches.
             → Increase work_mem or reduce the join input size.

#2  2026-05-10 14:31:44 UTC  myuser@mydb  duration=891.10ms  [auto_explain]
    [WARN] node=2 (Seq Scan) — sequential scan discards 5000x more rows than it returns
             → Add an index on "events" to support the filter (user_id = 99).

Quick start

Enable auto_explain in postgresql.conf:

shared_preload_libraries = 'auto_explain'
auto_explain.log_min_duration = 1000   # ms
auto_explain.log_format = json
auto_explain.log_analyze = on
log_line_prefix = '%m [%p] %q%u@%d '

Install and run:

go install github.com/bright98/pgwatch/cmd/pgwatch@latest

cp pgwatch.example.yaml pgwatch.yaml
# set log_file to your PostgreSQL log path

pgwatch run -c pgwatch.yaml      # daemon mode — flushes a report every hour
pgwatch report -c pgwatch.yaml   # one-shot — read the log once and exit

Output formats

pgwatch supports three output formats — terminal (default), json, and html. The HTML report is self-contained with no external dependencies: collapsible plan JSON, color-coded severity badges, and sortable findings.

How they fit together

Tool	What it does	When to use it
`pgexplain` (CLI)	Analyzes a single plan file or psql pipe	During development, in CI
`pgexplain` (library)	Embeds plan analysis into your own Go tool	Custom tooling, migration runners
`pgwatch`	Continuously monitors production slow query logs	Staging and production servers

Use pgexplain in development and CI to catch bad plans before they ship.
Use pgwatch in production to know which queries are hurting you right now, with suggestions attached.

Links

Both are written in Go, require no extensions beyond auto_explain (which ships with PostgreSQL), and are MIT licensed. Feedback and contributions are welcome.

DEV Community