Sriram Rajendran

Posted on Mar 31

Debugging PostgreSQL Query Plan Instability in Production

#database #postgres #opensource #sql

The query plan is only as good as the statistics behind it. When those statistics are wrong, the planner makes confident decisions based on a false reality.

We run a field technician dispatch system on PostgreSQL 14. The core query — find technicians matching specific dispatch criteria near a job site — ran in 3ms at 10am and 42ms at 2pm the same day. Same query text. Same bind parameters. The only thing that changed was the number of technicians available — as the fleet clocked in and dispatched through the day, the underlying data distribution shifted just enough to flip the query plan.

This post is the story of how we traced a 14x latency regression to a fundamental assumption baked into every cost-based query optimizer, and why our existing composite index didn't help. Along the way, we'll dig into pg_statistic, selectivity estimation, BitmapAnd mechanics, and the specific ways correlated boolean columns break the planner's world model.

But first — if you've never looked at how PostgreSQL decides whether to use your index, let's build up from first principles.

How PostgreSQL Chooses a Query Plan

PostgreSQL doesn't just "use the index." It evaluates multiple execution strategies and picks the one with the lowest estimated cost. This is the cost-based optimizer.

The optimizer doesn't see your data. It sees a statistical summary stored in pg_statistic (exposed via the pg_stats view). When ANALYZE runs — manually or via autovacuum — PostgreSQL samples rows and builds per-column statistics: most common values (MCVs), histograms, distinct counts, and null fractions.

For a single predicate like WHERE is_available = true, the planner looks up the MCV frequency for is_available. If true appears at 0.50, the estimated rows on a 15,000-row table is 7,500.

For multiple predicates, it applies the independence assumption:

P(A AND B AND C) = P(A) × P(B) × P(C)

This is correct if and only if the columns are statistically independent. The PostgreSQL docs are explicit:

"The planner normally assumes that multiple conditions are independent of each other, an assumption that does not hold when column values are correlated."
— Chapter 14.2: Statistics Used by the Planner

When the assumption holds, estimates are accurate. When it doesn't, they can be off by orders of magnitude. Our query hit the second case.

The Setup: Our Table and Query

How we ended up with six booleans

The technicians table tracks every field technician in the system — roughly 15,000 rows. It didn't start with six boolean columns. It started with two: is_active and is_available. Then we needed to track whether a technician was currently on a job, so is_dispatched arrived. A compliance incident led to is_blocked. Express delivery became a feature, so is_express_enabled. An ops request for soft-disabling technicians without removing them added is_suspended. Each boolean made sense in isolation, each was a small migration, and each shipped independently over about eighteen months.

CREATE TABLE technicians (
    id              BIGSERIAL PRIMARY KEY,
    zone_id         INTEGER NOT NULL,
    is_active       BOOLEAN NOT NULL DEFAULT false,
    is_available    BOOLEAN NOT NULL DEFAULT false,
    is_dispatched   BOOLEAN NOT NULL DEFAULT false,
    is_blocked      BOOLEAN NOT NULL DEFAULT false,
    is_express_enabled BOOLEAN NOT NULL DEFAULT false,
    is_suspended    BOOLEAN NOT NULL DEFAULT false,
    last_job_location GEOMETRY(Point, 4326),
    -- ... other columns
);

The technician_locations table stores real-time GPS positions (one row per technician, updated every few seconds).

Key indexes

-- Composite boolean index WITH zone
CREATE INDEX idx_technician_selection_v2 ON technicians
    (zone_id, is_active, is_available, is_dispatched, is_blocked, is_express_enabled, is_suspended);

-- Composite boolean index WITHOUT zone
CREATE INDEX idx_technician_selection ON technicians
    (is_active, is_available, is_dispatched, is_blocked, is_express_enabled, is_suspended);

-- Spatial index on last job location
CREATE INDEX idx_technician_last_job_geom ON technicians USING gist (last_job_location);

-- Spatial index on live GPS location
CREATE INDEX idx_technician_location_geom ON technician_locations USING gist (current_location);

The allocation query

The query finds technicians matching specific dispatch criteria within a geographic radius:

SELECT t.*, tl.*, ts.*
FROM technicians t
    INNER JOIN technician_locations tl ON t.id = tl.technician_id
    LEFT JOIN technician_stats ts ON t.id = ts.technician_id
    LEFT JOIN daily_utilization du ON t.id = du.technician_id
    -- ... additional left joins for tagging, blocklists, queue state
WHERE t.is_active = true
  AND t.is_available = true
  AND t.is_dispatched = true
  AND t.is_blocked = false
  AND t.is_express_enabled = true
  AND t.is_suspended = false
  AND ST_DWithin(tl.current_location, ST_SetSRID(ST_MakePoint(?, ?), 4326)::geography, 4000)
  AND tl.reported_at > now() - interval '20 minutes';

Six boolean conditions on technicians, plus a spatial predicate on technician_locations. Looks straightforward. The problem is entirely in how the planner estimates the boolean combination.

The Symptom: Same Query, Two Plans

We noticed the issue in our Grafana dashboards. The allocation query's P50 latency was steady around 4ms, but P99 kept spiking to 40–45ms within the same day. The pattern correlated with fleet activity — as more technicians clocked in and changed dispatch states through the day, the underlying data distribution shifted. The planner's statistics drifted with it, and the plan flipped.

When we pulled EXPLAIN ANALYZE hours apart on the same day, we saw two completely different plans.

The fast plan (~3ms): BitmapAnd

BitmapAnd  (cost=146.22..146.22 rows=4 width=0) (actual time=1.83..1.83 rows=0 loops=1)
  -> Bitmap Index Scan on idx_technician_selection
       (cost=0.00..5.22 rows=38 width=0) (actual time=0.45..0.45 rows=8720 loops=1)
       Index Cond: (is_active = true) AND (is_available = true) AND (is_dispatched = true)
                   AND (is_blocked = false) AND (is_express_enabled = true) AND (is_suspended = false)
  -> Bitmap Index Scan on idx_technician_last_job_geom
       (cost=0.00..140.75 rows=1628 width=0) (actual time=1.20..1.20 rows=8419 loops=1)
       Index Cond: (last_job_location && <bounding box>)

Bitmap Heap Scan on technicians t
    Heap Blocks: exact=191
    -> BitmapAnd (above)
    Filter: ST_DWithin(last_job_location, ..., 4000)
    Rows Removed by Filter: 92
    -> actual rows=1

Execution Time: 3.155 ms

The planner used BitmapAnd — it scanned both the boolean index and the spatial index on technicians, intersected the two bitmaps in memory, and only fetched 191 heap pages. One row survived all filters. Fast.

The slow plan (~42ms): Boolean index only

Bitmap Heap Scan on technicians t
    (cost=5.47..1205.33 rows=38 width=824) (actual time=0.62..8.14 rows=4626 loops=1)
  -> Bitmap Index Scan on idx_technician_selection
       (cost=0.00..5.22 rows=38 width=0) (actual time=0.48..0.48 rows=4626 loops=1)
       Index Cond: (is_active = true) AND (is_available = true) AND (is_dispatched = true)
                   AND (is_blocked = false) AND (is_express_enabled = true) AND (is_suspended = false)

  Nested Loop (actual loops=4558)
    -> Index Scan on technician_locations tl
         Index Cond: (technician_id = t.id)
         Filter: ST_DWithin(current_location, ..., 4000)
         loops=4558

Execution Time: 42.008 ms

The planner used only the boolean index. No spatial index. It fetched 4,626 rows from technicians, then nested-loop joined into technician_locations 4,558 times, applying ST_DWithin as a CPU filter on each loop. 14x slower.

Side by side

Metric	Fast Plan (BitmapAnd)	Slow Plan (Boolean only)
Strategy	BitmapAnd (boolean + spatial)	Bitmap Scan (boolean only)
Estimated rows from boolean index	38	38
Actual rows from boolean index	8,720	4,626
Heap pages fetched	191	4,626 nested loops
Spatial index used?	Yes	No
Execution time	3.1ms	42ms

A few things jump out. The estimated row count is 38 in both plans. That's the planner's selectivity estimate for the boolean combination. But the actual count is thousands of rows. The estimate is wrong by two orders of magnitude.

The difference between the plans isn't that one has better estimates — both are equally wrong. The difference is what the planner decided to do with that bad estimate.

The Root Cause: 229x Underestimate

The selectivity math

The planner looks up each boolean column's frequency independently from pg_statistic and multiplies them together:

is_active = true:          ~70%  → 0.70
is_available = true:       ~50%  → 0.50
is_dispatched = true:      ~25%  → 0.25
is_blocked = false:        ~95%  → 0.95
is_express_enabled = true: ~40%  → 0.40
is_suspended = false:      ~95%  → 0.95

Combined (assuming independence):
  0.70 × 0.50 × 0.25 × 0.95 × 0.40 × 0.95 = 0.0025

  15,000 rows × 0.0025 = 38 rows

The planner arrives at 38. The actual number is 8,720. That's a 229x underestimate.

Why the multiplication is wrong

These columns are not independent. They encode a finite state machine — the operational lifecycle of a field technician. The business logic enforces hard constraints between them:

Column A	Column B	Relationship
`is_active`	`is_available`	Available implies active
`is_active`	`is_dispatched`	Dispatched implies active
`is_available`	`is_dispatched`	Dispatched implies available
`is_blocked`	`is_active`	Blocked implies not active
`is_suspended`	`is_active`	Suspended implies not active
`is_suspended`	`is_available`	Suspended implies not available

In practice, 5 of the 6 booleans encode roughly 6 valid states (the sixth, is_express_enabled, is an independent capability flag):

State	is_active	is_available	is_dispatched	is_blocked	is_suspended
SUSPENDED	F	F	F	F	T
BLOCKED	F	F	F	T	F
INACTIVE	F	F	F	F	F
OFFLINE	T	F	F	F	F
IDLE	T	T	F	F	F
DISPATCHED	T	T	T	F	F

That's 6 valid states out of 2^5 = 32 theoretical combinations. The planner treats all 32 as equally possible and weights them accordingly. It doesn't know that is_blocked = true AND is_active = true can never occur in practice.

Actual selectivity of the boolean combination: ~58% (8,720 / 15,000)
Planner's estimate: 0.25% (38 / 15,000)
Error factor: 229x

The PostgreSQL docs even demonstrate this exact failure mode:

"The planner estimates the selectivity for each condition individually... Then it assumes that the conditions are independent, and so it multiplies their selectivities, producing a final selectivity estimate of just 0.01%. This is a significant underestimate, as the actual number of rows matching the conditions (100) is two orders of magnitude higher."
— Multivariate Statistics Examples

Why the Plan Flips: ANALYZE Sampling Non-Determinism

The 229x underestimate explains why the slow plan is slow. But why does the query sometimes get the fast plan?

Because the estimate isn't always exactly 38 — it drifts. ANALYZE takes a random sample, not a census:

"The statistics are only approximate, and will change slightly each time ANALYZE is run, even if the actual table contents did not change. This might result in small changes in the planner's estimated costs. In rare situations, this non-determinism will cause the planner's choices of query plans to change after ANALYZE is run."
— ANALYZE command reference

Our technicians table is high-churn — technicians toggle is_dispatched hundreds of times per day across the fleet. As the number of available technicians changes through the day, each ANALYZE sample captures a different snapshot. The boolean estimate drifts — and the plan is sensitive to that drift.

The tipping point

The planner's decision to include the spatial index in a BitmapAnd has a cost threshold. When the boolean estimate is "moderate" (say, 30–50 rows), the planner decides it's worth adding a second index scan to intersect. When the estimate drops below a threshold (say, 8–15 rows), the planner decides the boolean index alone is good enough and drops the spatial index.

When the boolean estimate is ~38:

Planner: "Boolean index returns ~38 rows. That's moderate.
          Adding the spatial index scan (est. 1,628 rows) and
          intersecting bitmaps will reduce heap fetches to ~4.
          The extra index scan is worth it."

Plan: BitmapAnd(boolean_index + spatial_index)
Result: 191 heap pages, 1 row survives → 3ms

When the boolean estimate drops to ~10:

Planner: "Boolean index returns only ~10 rows. That's already tiny.
          A second index scan costs more than just filtering 10 rows
          by distance. Not worth the overhead."

Plan: Bitmap Scan on boolean index only, ST_DWithin as CPU filter
Result: 8,720 rows scanned, spatial filter on every row → 42ms

The perverse incentive

Here's the cruel part: the MORE wrong the boolean estimate is (in the "too low" direction), the MORE confident the planner becomes that BitmapAnd is unnecessary — and the WORSE the actual performance gets. The planner drops the spatial index precisely when it would help the most.

Autovacuum is the trigger

Autovacuum triggers ANALYZE when the number of changed tuples exceeds autovacuum_analyze_threshold + autovacuum_analyze_scale_factor × table_size. With default settings, that's 50 + 0.10 × 15,000 = 1,550 tuple changes. On a table with thousands of state transitions per minute during peak hours, this threshold is crossed constantly. ANALYZE can fire multiple times within a few minutes, each time producing a slightly different boolean estimate that may or may not cross the BitmapAnd decision boundary.

BitmapAnd: How It Works and Why It's Fragile

BitmapAnd scans multiple indexes on the same table, builds a bitmap of matching heap page locations for each, and intersects them. Only pages in the intersection get fetched. In our fast plan, it intersects boolean-matching pages with spatial-matching pages, reducing thousands of candidates down to 191 heap pages.

Two things make it fragile for our case:

Same-table only. BitmapAnd can't combine indexes across different tables. Our spatial filter is on technician_locations, our boolean filter is on technicians — that's a join, not a merge. BitmapAnd only helps when both indexes live on the same table (like the last_job_location GiST index and the boolean index, both on technicians).

Cost-sensitive inclusion. Each additional bitmap index scan has a startup cost. When the boolean estimate is low enough, the planner decides the second index scan isn't worth it:

"Because each additional index scan adds extra time, the planner will sometimes choose to use a simple index scan even though additional indexes are available that could have been used as well."
— Chapter 11.5: Combining Multiple Indexes

The planner concludes: "I only expect 10 rows from the boolean index — filtering them by distance is cheaper than running a second index scan." The math is correct given the premise. The premise is just 229x wrong.

Why Our Composite Index Didn't Help

We already had a composite index on all six boolean columns. The natural assumption was: "PostgreSQL has an index on exactly this combination — surely it knows how many rows match?"

It doesn't. A composite index helps PostgreSQL find rows efficiently (access path), but it does NOT help PostgreSQL estimate how many rows exist before scanning (cardinality estimation). These are two separate systems.

"No entry is made for an ordinary non-expression index column, however, since it would be redundant with the entry for the underlying table column."
— pg_statistic catalog

ANALYZE samples heap pages and builds per-column MCVs independently. The B-tree's internal knowledge of which key combinations exist is never extracted into planner statistics. Our composite index found the right rows perfectly every time — the problem was that the planner's estimate of how many rows it would find was 229x too low, causing it to choose a bad join strategy around the index.

Only two things can store combination frequencies:

CREATE STATISTICS (mcv) — explicit opt-in, stores multi-column MCV lists
A single enum/status column — collapses the combination into one value

The Anti-Pattern: Dependent Booleans in OLTP

This isn't just a query tuning story. It's a schema design lesson.

When you model a state machine as independent boolean columns, you're making an implicit promise to the database: "these columns are independent dimensions." Every cost-based optimizer — PostgreSQL, MySQL, Oracle, SQL Server — takes you at your word. The planner multiplies their selectivities because that's mathematically correct for independent variables.

The problem is that the promise is false. is_available and is_dispatched aren't independent dimensions — they're states in a lifecycle. One implies the other. The planner can't know this from the schema alone.

This pattern tends to emerge organically. You start with is_active. A feature ships, you add is_available. A compliance requirement adds is_blocked. Each column is a small, low-risk migration. Nobody notices that the columns are accumulating mutual dependencies until the planner starts making bad decisions — and even then, the symptom (intermittent latency spikes) doesn't obviously point at schema design.

The fix is to model state as state. If your boolean columns have business-logic dependencies between them — if certain combinations can never occur — they should be a single enum or status column. One column, one MCV lookup, no multiplication error. WHERE status = 'idle' gives the planner an exact frequency. WHERE is_active = true AND is_available = true AND is_dispatched = false AND ... gives it a guess.

A note on InnoDB

Interestingly, this specific failure mode wouldn't manifest the same way on MySQL/InnoDB with the same composite index. InnoDB's optimizer uses a technique called index dive — when estimating the cardinality of a range scan on a composite index, it actually samples the B-tree directly rather than multiplying per-column statistics. For an equality scan across all columns of a composite index (which is what our query does), InnoDB dives into the index, reads a sample of pages at the leaf level, and estimates row count from the actual index structure.

This means InnoDB would see that the combination (true, true, false, false, true, false) maps to ~8,700 leaf entries, not 38. The estimate would be roughly correct, and the optimizer wouldn't make the same bad join order decision.

PostgreSQL doesn't do index dives for cardinality estimation. It always goes back to pg_statistic and multiplies. The composite index is invisible to the estimation layer — it's only visible to the execution layer. This is a deliberate design choice (keeping statistics separate from access paths), but it means PostgreSQL is more vulnerable to correlated-column estimation errors than InnoDB is, even when the right composite index already exists.

Wrapping Up

Always compare estimated vs. actual rows in EXPLAIN ANALYZE. If they diverge by more than 10x, the planner is making decisions based on a false premise.
Composite indexes don't fix estimation in PostgreSQL. They help the executor find rows, but pg_statistic stores per-column statistics independently. The planner still multiplies. (InnoDB's index dives would handle this better.)
Model state as state, not as independent booleans. If your boolean columns have mutual dependencies — if certain combinations can never occur — they belong in a single enum column.

PostgreSQL's cost-based optimizer is remarkably good. But it operates on a statistical model of your data, not the data itself. When your schema encodes assumptions that violate the model's assumptions, the planner makes rational decisions from irrational premises. Understanding where the model breaks is the difference between a query that runs in 3ms and one that runs in 42ms.

DEV Community