Michal Cyncynatus

Posted on Dec 23, 2025

When ANALYZE Isn't Enough: Debugging Bad Row Estimation in PostgreSQL

#database #postgres #sql #performance

As backend developers, we’ve all been there: a query that usually runs instantly suddenly hangs, or a specific endpoint times out in production. You dive into the investigation, and the first step is almost always running the query with EXPLAIN to see the execution plan.

If you use a visualization tool (like Dalibo’s plan viewer or Explain Depesz), you might spot a red flag staring back at you: "Bad Row Estimation."

Consider this query, which filters items for a specific organization:

SELECT *
FROM items
WHERE organization_id = 102;

Dalibo visualizes the plan and row estimation like this:

At first glance, it looks good: it uses an index and appears to be a fast query. But looking closer, you can see the planner expected to find nearly 34K rows but found only 365. Why does this discrepancy exist, and why should you care?

Why Row Estimation Matters

The Postgres Planner’s job is to calculate the cost of various execution paths and select the cheapest one. These calculations rely entirely on how many rows the planner expects to process.

When row estimation is wrong, the planner optimizes for a scenario that doesn't exist.

How It Breaks: Join Strategies
Consider a query joining items and item_types. The planner must decide how to join these tables, usually choosing between:

Nested Loop: Extremely fast if one side of the join is small (e.g., 10 rows).
Hash Join: More efficient for large datasets (e.g., 1,000,000 rows) but requires building a hash table in memory.

The Problem:
If the planner estimates a user has only 10 items (when they actually have 1,000,000), it will choose a Nested Loop. Instead of a quick Hash Join, the database effectively loops a million times, turning a millisecond query into one that hangs for minutes.

Beyond joins, faulty estimation affects other planner decisions made prior to running the query, such as:

Parallel Workers: The planner may decide the dataset is too small to justify launching parallel workers, leaving CPU cores idle while a single core struggles.
Index usage: It might choose to perform a sequential scan if it assumes there are many records, effectively ignoring a perfectly good index.

The bottom line: row estimation dictates planner decisions. Bad row estimation causes bad decisions.

The Standard Fix: ANALYZE

When a developer sees "Bad Row Estimation," the immediate reaction (and the first-aid advice you’ll get from ChatGPT or StackOverflow) is simple: Run ANALYZE.

ANALYZE samples the table data and updates the statistics stored in pg_statistic. It refreshes the planner's knowledge of:

Null fraction: How many rows are null?
Distinct values: How many unique items are there?
Most Common Values (MCV): Which values appear most often?
Histogram bounds: How is the data distributed?

Usually, this works. You run ANALYZE table_name, the stats update, the estimate becomes accurate, and the plan is fixed.

But what happens when you run ANALYZE, and the estimation is still wrong?

This is where we leave the realm of basic administration and enter the internals of the Postgres Planner. Here are three common scenarios where ANALYZE falls short, and how to fix them.

Case 1: The OUTLIERS (Data Skew)

Imagine you have a table of items ordered by a created_at date. Most of your data is historical or current.
When performing this query to get all the items created on December 30, 2025, you would get a stable and expected plan:

SELECT *
FROM items
WHERE created_at BETWEEN '2025-12-30' AND '2025-12-31';

However, at some point, a developer writes a test script that inserts a single record with a date in the year 2099. Suddenly, you get this plan:

The result? The planner estimated 4 rows.
The reality? 21K rows found.

The Issue

Postgres uses a histogram to model data distribution. It takes the minimum value and the maximum value, then divides them into buckets.

Postgres creates histogram buckets that try to hold an equal amount of data. For simplicity, let's assume it divided the 2025 data into 12 buckets. Each month gets a bucket and holds 8.33% of the data.

After adding the 2099 record, we change the range of the final bucket from 1/12/2025 - 31/12/2025 to 1/12/2025 - 1/1/2099.

This single bucket now spans 74 years but contains the same amount of data (e.g., 8.33%) as the other buckets for other months of 2025. When you query for a date like '2025-12-30', it falls into this massive final bucket. The planner assumes data within a bucket is spread uniformly. It thinks that 8.33% of the data is sprinkled thinly across 74 years, rather than being clustered in December 2025. This causes the massive underestimation.

Now that we understand the root cause of the underestimation, let's see what we can do to fix it.

The Solution

Clean your data: Remove the outliers (the 2099 records) if they are invalid. This might sound obvious, but sometimes we just don't know we have these outliers. If the outliers come from user input, consider being more defensive and adding input validation to prevent them, provided it fits your business logic.

Increase Statistics Target: You can tell Postgres to create more "buckets" for the histogram to get a higher-resolution view of the data.

ALTER TABLE items ALTER COLUMN created_at SET STATISTICS 1000; -- Default is 100
ANALYZE items;

You can verify the statistics target was updated from 100 to 1000 by running:

SELECT tablename, attname, n_distinct, histogram_bounds 
FROM pg_stats 
WHERE tablename = 'items' 
  AND attname = 'created_at';

Why does this work?

At 100 buckets (Low Res): Each bucket holds 1% of your data. The final bucket starts in late 2025 and ends at 2099. That outlier "poisons" the entire 1% chunk, causing Postgres to smear that data thinly across 74 years.
At 1000 buckets (High Res): Each bucket holds only 0.1% of your data. Now, the "bridge" to 2099 is contained entirely within the top 0.1% bucket. We poison less data, and the chance of getting a wrong estimation decreases.

Note: This change won't always help; it depends on the volume and distribution of data. If the desired date is still in the last bucket, it will be underestimated even after increasing the histogram resolution.

⚖️ Tradeoff: Increasing statistics targets means ANALYZE will take longer to run because it must collect more data samples.
📝 Running a multi tenant postgres? Note that the statistics are per column so if different tenants have different distribution it can also create this kind of cases.

Case 2: The Correlation Trap

This is perhaps the most common example. Consider an items table where a user has many items, but every user belongs to exactly one organization, and the user_id is unique across all organizations.

The following query will suffer from bad row estimation:

SELECT * 
FROM items 
WHERE organization_id = 1
  AND user_id = '2b56d251-4498-4be2-8390-395ffeed3ef3';

--- expected 21
--- actual 3,650

The result? The planner estimated 21 rows.
The reality? 3.5K rows found.

The Issue

Postgres calculates probabilities assuming columns are independent.

Probability of user_id '2b56d251...' is x
Probability of organization_id 1 is y

The planner calculates the combined probability as x*y.

However, these columns are highly correlated. If the user is '2b56d251...', they are definitely in Organization 1. The columns are not independent. The planner drastically underestimates the rows because it "double filters" the probability.

The Solution

Drop the redundant filter: If you don't strictly need to filter by organization, remove it.

SELECT * 
FROM items
WHERE user_id = '2b56d251-4498-4be2-8390-395ffeed3ef3';

--- expected 3,537
--- actual 3,650

Use Extended Statistics: If you must filter by organization (e.g., for security reasons), you need Extended Statistics. You can explicitly tell Postgres to calculate the correlation between these two columns.
```
CREATE STATISTICS items_user_org (dependencies) ON user_id, organization_id FROM items;
ANALYZE items;
```
You can verify it was created using:
```
SELECT * FROM pg_stats_ext;
```
With this statistic object in place, Postgres understands that user_id implies organization_id and will correct the row estimation.

⚖️ Tradeoff: As in the previous case, collecting these extra statistics adds time to the ANALYZE process.

Case 3: The Function Blind Spot

Developers frequently transform data directly within the WHERE clause. A common pattern involves COALESCE.

Suppose you need to retrieve items created or updated after a specific date. Using COALESCE seems like elegant syntactic sugar:

SELECT * 
FROM items
WHERE COALESCE(updated_at, created_at) > '2026-02-01';

The result? The planner estimated 2,549,816 rows.
The reality? Zero rows found.

The Issue

Postgres stores statistics for columns, not for function results. When you wrap a column in a function like COALESCE, the planner loses access to the underlying data distribution (histograms and Most Common Values).

Blinded, the planner falls back to a hard-coded default selectivity. Selectivity is the percentage of rows the planner expects to match the condition. For range inequalities (like >), Postgres blindly assumes the query will match 33.3% (1/3) of the table.

We can show this is happening by looking at the total row count:

SELECT * FROM items;
-- Expected:  7,649,448

The math is: 7,649,448 / 3 = 2,549,816.
Unless your data distribution happens to match this arbitrary 1-in-3 guess, your row estimation will be wrong, regardless of how often you run ANALYZE.

The Solution

Rewrite the query to expose the raw columns. While slightly more verbose, this allows the planner to see the actual data. (Pro tip: Document why you wrote it this way, so a helpful teammate doesn't refactor it back to the "cleaner" COALESCE version).

SELECT * 
FROM items
WHERE 
    -- Avoid COALESCE to enable statistics usage
    updated_at > '2026-02-01'
    OR 
    (updated_at IS NULL AND created_at > '2026-02-01');

With the columns exposed, the planner generates a correct estimate:

Bonus: Index Usage

Rewriting the query provides a secondary benefit: it allows the database to use existing indexes on updated_at or created_at.

In Postgres 14+, you can also use Extended Statistics on Expressions similar to what we did with the correlated columns. While this is a powerful option, for one-off queries, rewriting the SQL is often faster than altering the database schema definition and doesn't increase ANALYZE overhead.

When should I run ANALYZE?

In this post, we showed a few examples of when running ANALYZE won't help. So, when should we use ANALYZE? By default, autovacuum takes care of running ANALYZE periodically and updating the statistics. However, if your data is changing drastically or rapidly, consider manually running the ANALYZE command in order to update the stats as soon as possible.

Summary

Bad row estimation is a silent killer of query performance. While ANALYZE is the first line of defense, it isn't a magic wand.

When you see a plan that doesn't make sense:

Check for functions hiding columns in your WHERE clauses.
Look for outliers that might be skewing your histogram bounds.
Identify correlated columns that are tricking the planner's independence assumption.

Understanding how Postgres views your data allows you to write better queries and maintain a healthier database.

References

Transparency Note: While authoring this article I used Gemini AI for proofreading, rephrasing certain sections and generating image charts. All content and specifically the SQL scenarios are based on actual production challenges.

Top comments (4)

Mateus Rauli • Dec 24 '25

Great article, Michal!
In my case, in addition to creating statistics, I had to refactor some queries to apply filters in CTEs before the JOINs.
This significantly reduced estimation errors, especially with correlated columns, and helped the planner pick better execution plans.

Michal Cyncynatus • Dec 24 '25

Thanks for sharing! It's really interesting to see how CTEs affect row estimation—that's a great tip to keep in mind for my own tests. 🤩

Which Postgres version are you running? I know since PG 12 they’ve moved away from always materializing CTEs toward treating them more like inlined subqueries. I’m a big fan of this post on using CTEs as optimization barriers to guide the planner—definitely worth a read.

Mateus Rauli • Dec 25 '25

I started using PostgreSQL after version 12, so I didn't experience the old behavior directly. That said, I've read quite a bit about how CTEs used to act as hard optimization barriers.

Currently, I'm using PostgreSQL 16 and plan to migrate to 18. My experience so far has been with CTEs behaving more like embedded subqueries, which has been very useful for structuring, analyzing, and maintaining complex queries.

Sanjeev Singh • Dec 24 '25

Thank you for strong, practical insight. As a production DBRE, I don’t chase perfect plans; we optimize for predictability, guardrails, and minimizing blast radius when the planner is wrong — issues like skewed data, correlated columns, and function-wrapped predicates are exactly what we see in heavy-read production workloads.