<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Michal Cyncynatus</title>
    <description>The latest articles on DEV Community by Michal Cyncynatus (@michal_cyncynatus_3a792c2).</description>
    <link>https://dev.to/michal_cyncynatus_3a792c2</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3643367%2F5e03cbcc-64e0-4869-ae26-446b52537f0e.jpg</url>
      <title>DEV Community: Michal Cyncynatus</title>
      <link>https://dev.to/michal_cyncynatus_3a792c2</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/michal_cyncynatus_3a792c2"/>
    <language>en</language>
    <item>
      <title>When ANALYZE Isn't Enough: Debugging Bad Row Estimation in PostgreSQL</title>
      <dc:creator>Michal Cyncynatus</dc:creator>
      <pubDate>Tue, 23 Dec 2025 08:18:19 +0000</pubDate>
      <link>https://dev.to/michal_cyncynatus_3a792c2/when-analyze-isnt-enough-debugging-bad-row-estimation-in-postgresql-47n6</link>
      <guid>https://dev.to/michal_cyncynatus_3a792c2/when-analyze-isnt-enough-debugging-bad-row-estimation-in-postgresql-47n6</guid>
      <description>&lt;p&gt;As backend developers, we’ve all been there: a query that usually runs instantly suddenly hangs, or a specific endpoint times out in production. You dive into the investigation, and the first step is almost always running the query with &lt;code&gt;EXPLAIN&lt;/code&gt; to see the execution plan.&lt;/p&gt;

&lt;p&gt;If you use a visualization tool (like &lt;a href="https://explain.dalibo.com/" rel="noopener noreferrer"&gt;Dalibo’s&lt;/a&gt; plan viewer or &lt;a href="https://explain.depesz.com/" rel="noopener noreferrer"&gt;Explain Depesz&lt;/a&gt;), you might spot a red flag staring back at you: &lt;strong&gt;"Bad Row Estimation."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Consider this query, which filters items for a specific organization:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;organization_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;102&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dalibo visualizes the &lt;a href="https://explain.dalibo.com/plan/6fegb5g39e45eceg" rel="noopener noreferrer"&gt;plan&lt;/a&gt; and row estimation like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl4qvolu41iukcb2mqrx4.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl4qvolu41iukcb2mqrx4.jpg" alt="1_over_estimate_by_org_id" width="588" height="337"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At first glance, it looks good: it uses an index and appears to be a fast query. But looking closer, you can see the planner expected to find nearly 34K rows but found only 365. Why does this discrepancy exist, and why should you care?&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Row Estimation Matters
&lt;/h2&gt;

&lt;p&gt;The Postgres Planner’s job is to calculate the cost of various execution paths and select the cheapest one. These calculations rely entirely on how many rows the planner expects to process.&lt;/p&gt;

&lt;p&gt;When row estimation is wrong, the planner optimizes for a scenario that doesn't exist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How It Breaks: Join Strategies&lt;/strong&gt;&lt;br&gt;
Consider a query joining &lt;code&gt;items&lt;/code&gt; and &lt;code&gt;item_types&lt;/code&gt;. The planner must decide how to join these tables, usually choosing between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Nested Loop:&lt;/strong&gt; Extremely fast if one side of the join is small (e.g., 10 rows).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Hash Join:&lt;/strong&gt; More efficient for large datasets (e.g., 1,000,000 rows) but requires building a hash table in memory.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The Problem:&lt;/strong&gt;&lt;br&gt;
If the planner estimates a user has only &lt;strong&gt;10 items&lt;/strong&gt; (when they actually have &lt;strong&gt;1,000,000&lt;/strong&gt;), it will choose a &lt;strong&gt;Nested Loop&lt;/strong&gt;. Instead of a quick Hash Join, the database effectively loops a million times, turning a millisecond query into one that hangs for minutes.&lt;/p&gt;

&lt;p&gt;Beyond joins, faulty estimation affects other planner decisions made prior to running the query, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Parallel Workers:&lt;/strong&gt; The planner may decide the dataset is too small to justify launching parallel workers, leaving CPU cores idle while a single core struggles.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Index usage:&lt;/strong&gt; It might choose to perform a sequential scan if it assumes there are many records, effectively ignoring a perfectly good index.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The bottom line: &lt;strong&gt;row estimation dictates planner decisions.&lt;/strong&gt; Bad row estimation causes bad decisions.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Standard Fix: ANALYZE
&lt;/h2&gt;

&lt;p&gt;When a developer sees "Bad Row Estimation," the immediate reaction (and the first-aid advice you’ll get from ChatGPT or StackOverflow) is simple: &lt;strong&gt;Run ANALYZE&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ANALYZE&lt;/code&gt; samples the table data and updates the statistics stored in &lt;code&gt;pg_statistic&lt;/code&gt;. It refreshes the planner's knowledge of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Null fraction:&lt;/strong&gt; How many rows are null?&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Distinct values:&lt;/strong&gt; How many unique items are there?&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Most Common Values (MCV):&lt;/strong&gt; Which values appear most often?&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Histogram bounds:&lt;/strong&gt; How is the data distributed?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Usually, this works. You run &lt;code&gt;ANALYZE table_name&lt;/code&gt;, the stats update, the estimate becomes accurate, and the plan is fixed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But what happens when you run &lt;code&gt;ANALYZE&lt;/code&gt;, and the estimation is still wrong?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is where we leave the realm of basic administration and enter the internals of the Postgres Planner. Here are three common scenarios where &lt;code&gt;ANALYZE&lt;/code&gt; falls short, and how to fix them.&lt;/p&gt;


&lt;h2&gt;
  
  
  Case 1: The OUTLIERS (Data Skew)
&lt;/h2&gt;

&lt;p&gt;Imagine you have a table of &lt;code&gt;items&lt;/code&gt; ordered by a &lt;code&gt;created_at&lt;/code&gt; date. Most of your data is historical or current.&lt;br&gt;
When performing this query to get all the items created on December 30, 2025, you would get a stable and expected plan:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="s1"&gt;'2025-12-30'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="s1"&gt;'2025-12-31'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi8nsm8kza6o2duitwszj.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi8nsm8kza6o2duitwszj.jpg" alt="4_data_skew_no_future_date" width="517" height="301"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, at some point, a developer writes a test script that inserts a single record with a date in the year 2099. Suddenly, you get this plan:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbwpwil79pf6xxtqz65yo.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbwpwil79pf6xxtqz65yo.jpg" alt="5_data_skew_with_future_date" width="513" height="296"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The result?&lt;/strong&gt; The planner estimated 4 rows.&lt;br&gt;
&lt;strong&gt;The reality?&lt;/strong&gt; 21K rows found.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Issue
&lt;/h3&gt;

&lt;p&gt;Postgres uses a histogram to model data distribution. It takes the minimum value and the maximum value, then divides them into buckets.&lt;/p&gt;

&lt;p&gt;Postgres creates histogram buckets that try to hold an equal amount of data. For simplicity, let's assume it divided the 2025 data into 12 buckets. Each month gets a bucket and holds 8.33% of the data.&lt;/p&gt;

&lt;p&gt;After adding the 2099 record, we change the range of the final bucket from &lt;code&gt;1/12/2025 - 31/12/2025&lt;/code&gt; to &lt;code&gt;1/12/2025 - 1/1/2099&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9sodaywirxbkbtguxeon.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9sodaywirxbkbtguxeon.jpg" alt=" " width="800" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This single bucket now spans 74 years but contains the same amount of data (e.g., 8.33%) as the other buckets for other months of 2025. When you query for a date like '2025-12-30', it falls into this massive final bucket. The planner assumes data within a bucket is spread uniformly. It thinks that 8.33% of the data is sprinkled thinly across 74 years, rather than being clustered in December 2025. This causes the massive underestimation.&lt;/p&gt;

&lt;p&gt;Now that we understand the root cause of the underestimation, let's see what we can do to fix it.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Solution
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Clean your data:&lt;/strong&gt; Remove the outliers (the 2099 records) if they are invalid. This might sound obvious, but sometimes we just don't know we have these outliers. If the outliers come from user input, consider being more defensive and adding input validation to prevent them, provided it fits your business logic.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Increase Statistics Target:&lt;/strong&gt; You can tell Postgres to create more "buckets" for the histogram to get a higher-resolution view of the data.&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;STATISTICS&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;-- Default is 100&lt;/span&gt;
&lt;span class="k"&gt;ANALYZE&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;



&lt;p&gt;You can verify the statistics target was updated from 100 to 1000 by running:&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;tablename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_distinct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;histogram_bounds&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stats&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;tablename&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'items'&lt;/span&gt; 
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;attname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'created_at'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;




&lt;/li&gt;

&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Why does this work?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;At 100 buckets (Low Res):&lt;/strong&gt; Each bucket holds 1% of your data. The final bucket starts in late 2025 and ends at 2099. That outlier "poisons" the entire 1% chunk, causing Postgres to smear that data thinly across 74 years.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;At 1000 buckets (High Res):&lt;/strong&gt; Each bucket holds only 0.1% of your data. Now, the "bridge" to 2099 is contained entirely within the top 0.1% bucket. We poison less data, and the chance of getting a wrong estimation decreases.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note: This change won't always help; it depends on the volume and distribution of data. If the desired date is still in the last bucket, it will be underestimated even after increasing the histogram resolution.&lt;/p&gt;

&lt;p&gt;⚖️ &lt;strong&gt;Tradeoff:&lt;/strong&gt; Increasing statistics targets means &lt;code&gt;ANALYZE&lt;/code&gt; will take longer to run because it must collect more data samples.&lt;br&gt;
📝 Running a multi tenant postgres? Note that the statistics are per column so if different tenants have different distribution it can also create this kind of cases.&lt;/p&gt;


&lt;h2&gt;
  
  
  Case 2: The Correlation Trap
&lt;/h2&gt;

&lt;p&gt;This is perhaps the most common example. Consider an &lt;code&gt;items&lt;/code&gt; table where a user has many items, but every user belongs to exactly one organization, and the &lt;code&gt;user_id&lt;/code&gt; is unique across all organizations.&lt;/p&gt;

&lt;p&gt;The following query will suffer from bad row estimation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;organization_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2b56d251-4498-4be2-8390-395ffeed3ef3'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;--- expected 21&lt;/span&gt;
&lt;span class="c1"&gt;--- actual 3,650&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The result?&lt;/strong&gt; The planner estimated 21 rows.&lt;br&gt;
&lt;strong&gt;The reality?&lt;/strong&gt; 3.5K rows found.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Issue
&lt;/h3&gt;

&lt;p&gt;Postgres calculates probabilities assuming columns are &lt;strong&gt;independent&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Probability of &lt;code&gt;user_id&lt;/code&gt; '2b56d251...' is &lt;em&gt;x&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;  Probability of &lt;code&gt;organization_id&lt;/code&gt; 1 is &lt;em&gt;y&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The planner calculates the combined probability as x*y.&lt;/p&gt;

&lt;p&gt;However, these columns are &lt;strong&gt;highly correlated&lt;/strong&gt;. If the user is '2b56d251...', they are &lt;em&gt;definitely&lt;/em&gt; in Organization 1. The columns are not independent. The planner drastically underestimates the rows because it "double filters" the probability.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Solution
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Drop the redundant filter:&lt;/strong&gt; If you don't strictly need to filter by organization, remove it.&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2b56d251-4498-4be2-8390-395ffeed3ef3'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;--- expected 3,537&lt;/span&gt;
&lt;span class="c1"&gt;--- actual 3,650&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;




&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Use Extended Statistics:&lt;/strong&gt; If you must filter by organization (e.g., for security reasons), you need &lt;strong&gt;Extended Statistics&lt;/strong&gt;. You can explicitly tell Postgres to calculate the correlation between these two columns.&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;STATISTICS&lt;/span&gt; &lt;span class="n"&gt;items_user_org&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dependencies&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;organization_id&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ANALYZE&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;



&lt;p&gt;You can verify it was created using:&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stats_ext&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;



&lt;p&gt;With this statistic object in place, Postgres understands that &lt;code&gt;user_id&lt;/code&gt; implies &lt;code&gt;organization_id&lt;/code&gt; and will correct the row estimation.&lt;/p&gt;


&lt;/li&gt;

&lt;/ol&gt;

&lt;p&gt;⚖️ &lt;strong&gt;Tradeoff:&lt;/strong&gt; As in the previous case, collecting these extra statistics adds time to the &lt;code&gt;ANALYZE&lt;/code&gt; process.&lt;/p&gt;




&lt;h2&gt;
  
  
  Case 3: The Function Blind Spot
&lt;/h2&gt;

&lt;p&gt;Developers frequently transform data directly within the &lt;code&gt;WHERE&lt;/code&gt; clause. A common pattern involves &lt;code&gt;COALESCE&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Suppose you need to retrieve items &lt;strong&gt;created or updated&lt;/strong&gt; after a specific date. Using &lt;code&gt;COALESCE&lt;/code&gt; seems like elegant syntactic sugar:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;updated_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'2026-02-01'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyz8m9bgk4hrvkrowdcar.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyz8m9bgk4hrvkrowdcar.jpg" alt="2_coalesce_over_estimate" width="512" height="297"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The result?&lt;/strong&gt; The planner estimated 2,549,816 rows.&lt;br&gt;
&lt;strong&gt;The reality?&lt;/strong&gt; Zero rows found.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Issue
&lt;/h3&gt;

&lt;p&gt;Postgres stores statistics for &lt;em&gt;columns&lt;/em&gt;, not for function results. When you wrap a column in a function like &lt;code&gt;COALESCE&lt;/code&gt;, the planner loses access to the underlying data distribution (histograms and Most Common Values).&lt;/p&gt;

&lt;p&gt;Blinded, the planner falls back to a hard-coded &lt;strong&gt;default selectivity&lt;/strong&gt;. Selectivity is the percentage of rows the planner expects to match the condition. For range inequalities (like &lt;code&gt;&amp;gt;&lt;/code&gt;), Postgres blindly assumes the query will match &lt;strong&gt;33.3% (1/3)&lt;/strong&gt; of the table.&lt;/p&gt;

&lt;p&gt;We can show this is happening by looking at the total row count:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- Expected:  7,649,448&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The math is: &lt;code&gt;7,649,448 / 3 = 2,549,816&lt;/code&gt;.&lt;br&gt;
Unless your data distribution happens to match this arbitrary 1-in-3 guess, your row estimation will be wrong, regardless of how often you run &lt;code&gt;ANALYZE&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Solution
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Rewrite the query&lt;/strong&gt; to expose the raw columns. While slightly more verbose, this allows the planner to see the actual data. (Pro tip: Document &lt;em&gt;why&lt;/em&gt; you wrote it this way, so a helpful teammate doesn't refactor it back to the "cleaner" &lt;code&gt;COALESCE&lt;/code&gt; version).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; 
    &lt;span class="c1"&gt;-- Avoid COALESCE to enable statistics usage&lt;/span&gt;
    &lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'2026-02-01'&lt;/span&gt;
    &lt;span class="k"&gt;OR&lt;/span&gt; 
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'2026-02-01'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the columns exposed, the planner generates a correct estimate:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1v4f4191vvlijb2mjk30.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1v4f4191vvlijb2mjk30.jpg" alt="3_coalesce_fix" width="800" height="522"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Bonus: Index Usage
&lt;/h4&gt;

&lt;p&gt;Rewriting the query provides a secondary benefit: it allows the database to use existing indexes on &lt;code&gt;updated_at&lt;/code&gt; or &lt;code&gt;created_at&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In Postgres 14+, you can also use &lt;strong&gt;Extended Statistics on Expressions&lt;/strong&gt; similar to what we did with the correlated columns. While this is a powerful option, for one-off queries, rewriting the SQL is often faster than altering the database schema definition and doesn't increase &lt;code&gt;ANALYZE&lt;/code&gt; overhead.&lt;/p&gt;




&lt;h2&gt;
  
  
  When should I run ANALYZE?
&lt;/h2&gt;

&lt;p&gt;In this post, we showed a few examples of when running &lt;code&gt;ANALYZE&lt;/code&gt; won't help. So, when should we use &lt;code&gt;ANALYZE&lt;/code&gt;? By default, &lt;code&gt;autovacuum&lt;/code&gt; takes care of running &lt;code&gt;ANALYZE&lt;/code&gt; periodically and updating the statistics. However, if your data is changing drastically or rapidly, consider manually running the &lt;code&gt;ANALYZE&lt;/code&gt; command in order to update the stats as soon as possible.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Bad row estimation is a silent killer of query performance. While &lt;code&gt;ANALYZE&lt;/code&gt; is the first line of defense, it isn't a magic wand.&lt;/p&gt;

&lt;p&gt;When you see a plan that doesn't make sense:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Check for &lt;strong&gt;functions&lt;/strong&gt; hiding columns in your &lt;code&gt;WHERE&lt;/code&gt; clauses.&lt;/li&gt;
&lt;li&gt; Look for &lt;strong&gt;outliers&lt;/strong&gt; that might be skewing your histogram bounds.&lt;/li&gt;
&lt;li&gt; Identify &lt;strong&gt;correlated columns&lt;/strong&gt; that are tricking the planner's independence assumption.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Understanding how Postgres views your data allows you to write better queries and maintain a healthier database.&lt;/p&gt;

&lt;h3&gt;
  
  
  References
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://www.postgresql.org/docs/current/planner-stats.html" rel="noopener noreferrer"&gt;PostgreSQL Documentation: Row Estimation&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://www.postgresql.org/docs/current/view-pg-stats.html" rel="noopener noreferrer"&gt;PostgreSQL Documentation: pg_stats view&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://gist.github.com/michal-cyn/dde9d74fb222dfeb6aaa3fcca9966e8b" rel="noopener noreferrer"&gt;All code snippets and reproducible sandbox&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Transparency Note:&lt;/em&gt; While authoring this article I used Gemini AI for proofreading, rephrasing certain sections and generating image charts. All content and specifically the SQL scenarios are based on actual production challenges.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>sql</category>
      <category>database</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
