JoongHyuk Shin

Posted on Jun 21

1.4.8 Statistics: pg_statistic and Selectivity

#postgres #database #internals #planner

From 1.4.2 through 1.4.5, one number showed up every time the planner computed a cost. Weighing a sequential scan against an index scan, deciding which join method is cheaper, gauging intermediate result sizes while reordering joins: underneath all of it sits an estimate of "how many rows pass this condition." That number is called cardinality.

So how does the planner know it? It cannot actually count how many of 100,000 rows WHERE country = 'KR' keeps, not at planning time. Scanning the whole table just to build a plan would already be running the query once. So the planner consults statistics it prepared in advance. Where those statistics live, in what shape, and how cardinality comes out of them is the starting point of the whole cost calculation. If the start is off, every cost and join order downstream is off with it.

Selectivity Is the Surviving Fraction

The unit the planner works in is not an absolute row count but a fraction. Selectivity, the fraction of all rows that a WHERE clause keeps (between 0 and 1), is that unit. A selectivity of 0.5 means half the rows pass the condition; 0.001 means one row in a thousand survives.

Cardinality follows directly:

estimated rows = selectivity × input rows

The rows=... that EXPLAIN prints is exactly the result of this multiplication. On a 100,000-row table, a condition with selectivity 0.5 shows up as rows=50000. Every cost calculation the planner does ultimately takes this estimated row count as input. One wrong selectivity throws off the whole cost comparison.

That leaves one question. What does the planner look at to estimate selectivity?

The Planner's Source: pg_statistic

The answer is pg_statistic, the system catalog that holds per-column statistics. Statistics for each column of a table are stored here. The catalog itself is locked away from ordinary users, so you look at it through pg_stats, a view that exposes the statistics in human-readable form.

The statistics stored for one column fall into two kinds. One is a set of fixed fields present regardless of column type; the other is a set of slots filled according to the shape of the data. The fixed fields hold the fraction of NULLs, the number of distinct values, and the average byte width. The slots hold statistics of differing kinds, such as a list of common values or a histogram of the value distribution. The statistics for one column look like this:

pg_statistic (one column)
├─ fixed fields (every column)
│    null_frac     fraction of NULL rows
│    n_distinct    number of distinct values
│    avg_width     average byte width of a value
└─ slots (filled by data shape)
     MCV           common values and their frequencies
     histogram     boundaries that split the distribution
     correlation   value order vs physical storage order

Why a slot structure? Because different data types call for different statistics. A scalar value like an integer or a date benefits from a distribution histogram, but an array column needs something else, like its "most common elements." So PostgreSQL stores, in each slot, a kind code marking what sort of statistic lives there, and code reading the statistics searches by kind code rather than by slot number.

The Four Statistics Behind Selectivity Estimation

Four kinds of statistics do the actual work in selectivity estimation. Real data makes it click. I set up a 100,000-row table stat_demo. The country column has a skewed distribution (KR is half, US a fifth, JP a tenth, and the remaining 20% is spread over a few hundred rarely-used two-letter country codes); amount is a value spread evenly from 1 to 100,000; and note is half NULL. After running the statistics-collection command ANALYZE once, querying the country column's statistics from pg_stats gives this:

SELECT null_frac, n_distinct, most_common_vals, most_common_freqs
FROM pg_stats WHERE tablename = 'stat_demo' AND attname = 'country';

null_frac         | 0
n_distinct        | 598
most_common_vals  | {KR, US, JP, FO, ...}
most_common_freqs | {0.5032, 0.1982, 0.1013, 0.00067, ...}

The four statistics divide up the work like this.

First, null_frac is the fraction of rows that are NULL. country has no NULLs, so it is 0; but the note column, half of which is NULL, would read about 0.5. The selectivity of WHERE note IS NULL is exactly that value, which on 100,000 rows estimates about 50,000 rows.

Second, n_distinct is how many distinct values there are. If an order-status column held only the three values 'paid', 'shipping', 'delivered', this would be 3. For country it came out as 598. This value serves as the denominator when estimating a value that is not common in an equality condition, because if values are spread evenly, one value takes up roughly 1/n_distinct.

This value is read differently depending on its sign. When positive, it is the count of distinct values, as just described. When negative, it means a multiple of the row count. Querying the amount column gives -0.53346, which means the number of distinct values is 0.53346 times the row count, that is, about 53,000 on 100,000 rows. The reason for also allowing a negative form is that a table's row count can be updated more often than the statistics are. Pinning the count as an absolute number would make it drift as rows grow, but storing it as a ratio lets the estimate move with the row count.

Third is most_common_vals (the most common values, MCV for short). For country the MCV came out as {KR, US, JP, FO, ...}. The leading KR, US, JP are the genuinely common values, taking up half, a fifth, and a tenth of the data. The FO at the end is different. It is one of the rarely-used codes that fill the remaining 20%; those rare codes all have a similar frequency (about 30 rows each), but in drawing the random sample FO happened to be picked most often among them and slipped onto the end of the MCV list. The fraction each value takes up sits right beside it in most_common_freqs, in the same order. KR is 0.5032, US is 0.1982. An equality condition on a common value uses this frequency directly.

Fourth, the histogram is a list of boundary values that split the value distribution into buckets. It does not appear on a categorical value like country; it appears on a column you can line up by magnitude, like amount. Querying amount gives these boundaries in histogram_bounds:

{7, 1049, 2049, 3114, 4167, 5139, ... , 99008, 99995}

With 101 boundaries there are 100 buckets between them. ANALYZE picks the boundaries so each bucket holds 1% of the rows (1,000 rows here). Because amount is spread evenly, the boundaries land at roughly 1,000 apart. A range condition like amount > 90000 follows this list of boundaries.

There is a reason MCV and the histogram are kept apart. A few values that are skewed and frequent behave differently from the rest, which are scattered at similar frequencies. The few frequent values are best remembered one frequency at a time, while the rest are best lumped into buckets. So a column can carry both an MCV and a histogram, and in that case the histogram describes only the distribution left over after the MCV values are removed.

Equality: Common Values vs Rare Values

The function that computes the selectivity of an equality condition col = constant (var_eq_const) treats common values and rare values differently. Same equality, but the estimation path forks.

If the constant is in the MCV list, the story is simple. ANALYZE already measured that value's frequency, so it is used as is. For WHERE country = 'KR':

country='KR'   -> rows=50323   (MCV frequency 0.5032 × 100000)
country='US'   -> rows=19820   (MCV frequency 0.1982 × 100000)
country='JP'   -> rows=10133   (MCV frequency 0.1013 × 100000)

The estimated row count is exactly proportional to the frequency. The more common the value, the more directly the statistics remember its frequency, so the estimate is most trustworthy.

If the constant is not in the MCV, the estimate takes one more step. How common it is, the planner does not know; but it does know the value sits among "everything that is not a common value." So it subtracts the sum of the common values' frequencies and the NULL fraction from the whole to get "the fraction the rest occupies," and divides that by the number of distinct values left after the common ones. It assumes the remaining values are all spread evenly at similar frequencies. For instance, if the common values take up 80% of the table and the remaining 20% is split among 100 distinct rare values, one rare value's selectivity is taken as 0.2 ÷ 100 = 0.002.

country='AAA' -> rows=33   (a value not in the MCV)

Tracing where this 33 comes from shows the logic of the estimate directly. The frequencies in the MCV add up to about 0.807. Most of it is taken by the common values KR, US, JP (0.5032 + 0.1982 + 0.1013 = 0.8027), with the rest of the rare MCV values adding a little more. There are no NULLs, so the remaining fraction is 1 - 0.807 ≈ 0.193. The number of distinct values left after the common ones is 598 total minus the 9 in the MCV, which is 589. So the selectivity is 0.193 ÷ 589 ≈ 0.00033, and multiplied by 100,000 it comes to about 33. That matches the 33 EXPLAIN printed.

Here an inherent limit of statistics-based estimation surfaces. AAA is in fact a value with not a single row in this table. Yet the estimate is 33. The planner does not know the value really is absent. It only has the information "not in the MCV," so it treats a value that does not exist and a value that exists rarely exactly the same, both at 33. Statistics are a summary drawn from a sample, so for a value the summary never captured, the planner can only fill in with the even-spread assumption.

When there are no statistics at all (a fresh table that has never been analyzed, say), the estimate gets cruder still. The default selectivity for equality is hardwired at 0.005, that is, one in 200. This is the same as assuming there are about 200 distinct values. It is the last-resort guess for when there is nothing to go on.

Ranges Use the Histogram, ANDs Multiply

A range condition like col > value or col < value uses the histogram seen earlier. It finds which bucket the constant falls into, and converts how far into that bucket it sits into a fraction. Since amount's boundaries were about 1,000 apart, for amount > 90000 the value 90000 sits just below the eleventh boundary from the end, leaving about 10 buckets above it. That is roughly the top 10%.

amount > 90000              -> rows=10007   (top ~10%)
amount BETWEEN 25000 AND 75000 -> rows=50071 (middle ~50%)

The condition asking for the top 10% estimates about 10,000 rows, and the one asking for the middle half about 50,000. With an even distribution, the fraction and the row count line up naturally.

When several conditions are joined with AND, the planner multiplies each one's selectivity. Putting country = 'KR' (0.5032) and amount > 90000 (about 0.10) together gives 0.5032 × 0.10 ≈ 0.050, about 5,000 rows.

country='KR' AND amount>90000 -> rows=5036   (actual 4938)

The estimate and reality nearly agree. But this multiplication hides an assumption: that the two conditions are independent. The example above was accurate because country and amount were built to be genuinely unrelated in the synthetic data. In real data, two columns are often entangled, and that is where this multiplication breaks down.

When Columns Are Correlated: Extended Statistics

Think of city and zip code. Once the zip code is fixed, the city is effectively determined. Put a condition like city = 'Seoul' AND zipcode = '06236', and the planner simply multiplies the two selectivities. But almost every row with zip code 06236 is already in Seoul, so city = 'Seoul' filters out essentially nothing beyond what the zip code condition already removed. The multiplication, treating the two as unrelated, shaves the fraction twice anyway. The result is a row count far below reality, an underestimate.

Every statistic seen so far is about a single column, so it cannot catch this entanglement. So PostgreSQL separately offers extended statistics, which collect the correlation between several columns as a group. With CREATE STATISTICS you name a group of columns, and ANALYZE measures that group's functional dependencies or its group-wise distinct count. The planner then knows city and zipcode are effectively one lump and applies the fraction only once.

CREATE STATISTICS addr_stat (dependencies) ON city, zipcode FROM addresses;
ANALYZE addresses;

Extended statistics objects have one interesting property. Unlike an index, they are designed as independent objects, not tied to a particular table.

This independence also shows up in ownership. An index, no matter who creates it, takes its owner from the table's owner; but the owner of a statistics object made with CREATE STATISTICS is whoever ran the CREATE statement. I saw this difference as an inconsistency, a bug, and proposed a patch to the PostgreSQL community to make a statistics object's owner follow the table's owner like an index does. My reasoning was this: unlike a view, which can be defined over several tables, statistics are built for a single table only, so like an index or a trigger they are strictly bound to that table, and ownership should follow the table. The proposal was not accepted. A statistics object having a different ownership model from an index is by design, and a PG core contributor pointed out the reason. There is a plan to eventually allow statistics defined across several tables, so they were deliberately made into independent objects not bound to any one table.

ANALYZE: Statistics Come from a Sample

Every statistic seen so far is filled in by ANALYZE. But ANALYZE does not read the whole table. It draws a random sample (some rows pulled at random from the table) and estimates the statistics from there.

The sample size is, by default, about 30,000 rows. Precisely, it is 300 times the setting that controls statistics precision (default_statistics_target, default 100). What is interesting is that this size is almost independent of the total table size. Whether the table has 10,000 rows or a billion, the sample is about 30,000 rows either way.

This can seem to clash with intuition. Shouldn't a bigger table call for looking at more? An opinion poll lines the intuition up. A national poll asks 1,000 to 2,000 people and produces a result with similar error whether the electorate is 10 million or 40 million. It does not scale the sample up in proportion to the population. The reliability of a sample is set by the size of the sample itself, not by what percentage of the whole it is. ANALYZE is the same. PostgreSQL's sample-size formula follows a 1998 paper (Chaudhuri, Motwani, Narasayya) on how large a sample histogram construction needs, and its conclusion too is that the sample size needed for a given precision barely grows no matter how large the table gets. So rather than growing the sample with the row count, a fixed absolute size gives enough precision.

That the statistics come from a sample also means they are always an approximation. And they are a snapshot from the moment ANALYZE ran. If the table changes a lot afterward, the statistics go stale. How stale statistics wreck an estimate is clearest seen directly. Starting from 100,000 rows (KR is half) and running ANALYZE, I then inserted 100,000 more KR rows and, without running ANALYZE again, asked the same condition:

After INSERT (before ANALYZE): country='KR' estimate = 99,573   (actual 150,030)
After ANALYZE                : country='KR' estimate = 150,060   (actual 150,030)

Right after the INSERT, the planner does roughly sense that the table grew and raises the total row count to near 200,000. But KR's frequency is still stale at 0.50. The real KR share is now 75% (150,030 / 200,000), yet multiplying by 0.50 underestimates it at about 100,000 rows. Only after ANALYZE remeasures the frequency does an estimate near 150,000 come out. The crux is that selectivity is a fraction: when the fraction is stale, the estimate misses even when the total row count is right.

Finally, back to why pg_statistic is locked away from ordinary users. The MCV slot holds actual data values (the most common country codes, the most common salary figures, and so on). The statistics themselves expose part of the data, so an unprivileged user reading the raw catalog leaks information. So the raw pg_statistic is blocked, and the human-facing pg_stats view is filtered to show only statistics for columns the reader is allowed to see.

What this means in practice

First, after a big data change, running ANALYZE yourself is the safe move. Right after a bulk load, a bulk delete, or a large update, the statistics are stale, and as the example above showed, selectivity being a fraction means the estimate misses even when the total row count is right. Autovacuum will eventually run ANALYZE, but there is a lag until it crosses its threshold. If you have to run a heavy query right after loading, slipping a manual ANALYZE in between is the better choice. When a query suddenly turns slow, checking whether the relevant column's statistics in pg_stats match reality is the starting point of diagnosis.

Second, a query whose estimate is badly off shows its cause when you line up EXPLAIN ANALYZE's estimated and actual row counts. If it says rows=33 but 50,000 rows actually came out, that node's selectivity estimate is wrong. And the error does not stop there. When one table's estimate is off, the join sitting on top of it is off by more, and it amplifies as join stages stack up. So finding the node where estimate and actual diverge by more than tenfold usually points at that column's statistics or a correlation between conditions, and the starting point of a slow query is often a single column with skewed statistics.

Third, when estimates are inaccurate on a very large table, you can raise statistics precision per column. With the sample fixed at about 30,000 rows, a large table with very many distinct values or an awkward distribution may have a sample that cannot capture the whole distribution. Raising that column's statistics target makes the sample size and the MCV and histogram sizes grow together, sharpening the estimate. For example, to raise country's target from the default 100 to 1000:

ALTER TABLE stat_demo ALTER COLUMN country SET STATISTICS 1000;
ANALYZE stat_demo;

You have to run ANALYZE again after changing the setting for the new target to take effect in the statistics. But ANALYZE cost and planning time grow with it, so it is right to apply this selectively, only to columns where the estimate is actually a problem.