Franck Pachot for YugabyteDB

Posted on Dec 9, 2023

Extended Statistics and pg_hint_plan /+ Rows() /

#postgres #database #yugabytedb #sql

By default, the cost based optimizer estimates predicate selectivity by gathering the number of distinct values for each column. When multiple column filters are combined, the query planner assumes no correlation between columns and multiplies the selectivity. This can lead to bad execution plans.

There are two ways to give more information to the PostgreSQL query planner:

correlated columns within a single table: extended statistics so that the number of distinct values is also gathered for column groups
correlated columns in joined tables: Optimizer hints so adjust the cardinality estimations

TL;DR: you need both to give all necessary statistics to the query planner optimizer.

An example in PostgreSQL

In a PostgreSQL lab, I load a table with cities and countries

postgres=# create table countries_cities (
  id bigint primary key, name text not null,
  state_id   int   not null  ,
  state_code   text not null,  state_name text not null,
  country_id int   not null  ,
   country_code text not null,  country_name text not null,
  latitude   float not null, longitude      float not null,
  wikidataid text default null
);

CREATE TABLE

postgres=# \! wget -qc -O /var/tmp/countries_cities.csv https://github.com/dr5hn/countries-states-cities-database/raw/master/csv/cities.csv

postgres=# \copy countries_cities from '/var/tmp/cities.csv' with csv header

COPY 150573

postgres=# vacuum analyze countries_cities;

VACUUM

I plan to query the cities in California and compare the estimated cardinality (cost... rows=) to the actual one (actual... rows=) using explain analyze. This will help us ensure the estimations are accurate for more complex queries. If the estimates are incorrect, the query planner may choose an inefficient execution plan when the result is joined with other tables.

With a single-column predicate (state_code='CA'), the estimation is correct: 1074 rows estimated, 1233 actual ones:

postgres=# explain (analyze, summary off)
 select * from countries_cities
 where state_code='CA';

                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Seq Scan on countries_cities  (cost=0.00..4154.16 rows=1074 width=77) (actual time=1.628..14.918 rows=1233 loops=1)
   Filter: (state_code = 'CA'::text)
   Rows Removed by Filter: 149340

Now, because there may be a 'CA' code in other countries, I query with the country code (state_code='CA' and country_code='US') but the optimizer under-estimates the number of rows, 139 instead of 1124 actual ones:

postgres=# explain (analyze, summary off)
 select * from countries_cities
 where state_code='CA' and country_code='US';

                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Seq Scan on countries_cities  (cost=0.00..4530.59 rows=139 width=77) (actual time=15.577..18.132 rows=1124 loops=1)
   Filter: ((state_code = 'CA'::text) AND (country_code = 'US'::text))
   Rows Removed by Filter: 149449

The reason is that there are no optimizer statistics that accounts for the correlation between the country code and the state code: for most cities in this table, CA is in US:

postgres=#  select count(*), country_code from countries_cities where state_code='CA' group by country_code order by 1;

 count | country_code
-------+--------------
     1 | CV
     1 | BI
     2 | SV
     2 | NA
     2 | MD
     2 | GW
     9 | NI
    17 | LU
    29 | UY
    44 | ES
  1124 | US

You can add redundant predicates, like state_code='CA' and country_code='US' and state_name='California' and country_name='United States' and you will quickly see the under-estimation showing rows=1 even if there's still 1124 rows. This will generate very bad execution plans for more complex queries.

We need to give more information to the query planner so that the optimizer can know the correlation between those columns.

Single Table: Extended Statistics

When the correlation is in a single table, we can gather extended statistics with CREATE STATISTICS and ANALYZE:

postgres=# create statistics counties_cities_distinct (ndistinct)
 on state_code  , country_code, state_name, country_name
 from countries_cities;

CREATE STATISTICS

postgres=# analyse countries_cities;
ANALYZE

Those are visible in pg_stat_ext:

postgres=# \x

postgres=# select statistics_name, attnames, exprs, kinds, n_distinct
 from pg_stats_ext where tablename='countries_cities' ;

-[ RECORD 1 ]---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
statistics_name | counties_cities_distinct
attnames        | {state_code,state_name,country_code,country_name}
exprs           |
kinds           | {d}
n_distinct      | {"4, 5": 2388, "4, 7": 2395, "4, 8": 2395, "5, 7": 2394, "5, 8": 2394, "7, 8": 190, "4, 5, 7": 2395, "4, 5, 8": 2395, "4, 7, 8": 2395, "5, 7, 8": 2394, "4, 5, 7, 8": 2395}

postgres=> select attnum, attname from pg_attribute
 where attrelid='countries_cities'::regclass 
 and attname in ('state_code','country_code','state_name','country_name')
 order by 1;

 attnum |   attname
--------+--------------
      4 | state_code
      5 | state_name
      7 | country_code
      8 | country_name
(4 rows)

This tells the optimizer that there are approximately (this comes from a sample) 190 distinct values for the combination of (country_code, country_name), which is the number of distinct countries, 2388 distinct values for the combination of (state_code, state_name), which is the number of distinct states, and 2394 or 2395 distinct values when combining attributes from states and countries, accounting for the same state names in multiple countries.

Normalization to two tables

Except if it is a datawarehouse dimension, you will rarely find this in a single table. Let's normalize this to two tables

postgres=# create table countries as
 select distinct country_id , country_code as country_code, country_name from countries_cities;

SELECT 198

postgres=# create statistics countries_distinct (ndistinct) 
 on country_code, country_name from countries;

CREATE STATISTICS

postgres=# create table cities as
  select id, name, country_id, state_code, state_name, latitude, longitude, wikidataid from countries_cities;

SELECT 150573

postgres=# create statistics cities_state_distinct (ndistinct) 
 on state_code, state_name from cities;

CREATE STATISTICS

postgres=# alter table countries add primary key (country_id);

ALTER TABLE

postgres=# alter table cities add primary key (id);

ALTER TABLE

postgres=# alter table cities add foreign key (country_id) references countries(country_id);

ALTER TABLE

postgres=# analyze countries, cities;

ANALYZE

I can query those tables with a join. With a single-column predicate, the estimation (rows=1144) is close to the actual result (rows=1233):

postgres=# explain (analyze, summary off)
 select * from countries join cities using(country_id)
 where state_code='CA';

                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=6.46..3555.69 rows=1144 width=62) (actual time=1.675..15.134 rows=1233 loops=1)
   Hash Cond: (cities.country_id = countries.country_id)
   ->  Seq Scan on cities  (cost=0.00..3546.16 rows=1144 width=50) (actual time=1.620..14.797 rows=1233 loops=1)
         Filter: (state_code = 'CA'::text)
         Rows Removed by Filter: 149340
   ->  Hash  (cost=3.98..3.98 rows=198 width=16) (actual time=0.048..0.048 rows=198 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 18kB
         ->  Seq Scan on countries  (cost=0.00..3.98 rows=198 width=16) (actual time=0.007..0.020 rows=198 loops=1)

However with a predicate on each table, the query planner under-estimates the cardinality: 6 rows instead of 1124 rows:

postgres=# explain (analyze, summary off)
 select * from countries join cities using(country_id)
 where state_code='CA' and country_code='US';

                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..3564.94 rows=6 width=62) (actual time=12.466..14.660 rows=1124 loops=1)
   Join Filter: (countries.country_id = cities.country_id)
   Rows Removed by Join Filter: 109
   ->  Seq Scan on countries  (cost=0.00..4.47 rows=1 width=16) (actual time=0.018..0.028 rows=1 loops=1)
         Filter: (country_code = 'US'::text)
         Rows Removed by Filter: 197
   ->  Seq Scan on cities  (cost=0.00..3546.16 rows=1144 width=50) (actual time=1.537..14.464 rows=1233 loops=1)
         Filter: (state_code = 'CA'::text)
         Rows Removed by Filter: 149340

Can we create extended statistics? The syntax exists but is not implemented (I'm running this in PostgreSQL 16):

postgres=# create statistics x (ndistinct)  on state_code , country_code
 from countries join cities using(country_id);

ERROR:  0A000: only a single relation is allowed in CREATE STATISTICS
LOCATION:  ProcessUtilitySlow, utility.c:1885
Time: 0.781 ms

When the correlation is between two joined tables, we need something else and this is where optimizer hints can solve the problem.

Across joins: Rows() Hint

There's no optimizer hints in PostgreSQL but we can use pg_hint_plan extension. As it is installed by default in YugabyteDB (which is a PostgreSQL fork with horizontal scalability), I'm running the same in my 3 nodes lab after enabling the Cost Based Optimizer, which is not yet set as the default in YugabyteDB 2.19.3

yugabyte=#  set yb_enable_optimizer_statistics=on;
SET

yugabyte=#  set yb_enable_base_scans_cost_model=on;
SET

Here is the query with the optimizer mis-estimate (result estimated to 6 rows instead of 1124):

yugabyte=# explain (analyze, summary off)
 select * from countries join cities using(country_id)
 where state_code='CA' and country_code='US';

                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=11.25..94944.27 rows=6 width=62) (actual time=29.371..30.045 rows=1124 loops=1)
   Join Filter: (countries.country_id = cities.country_id)
   Rows Removed by Join Filter: 109
   ->  Seq Scan on countries  (cost=3.83..148.58 rows=1 width=16) (actual time=1.190..1.192 rows=1 loops=1)
         Remote Filter: (country_code = 'US'::text)
   ->  Seq Scan on cities  (cost=7.42..94781.45 rows=1139 width=50) (actual time=28.173..28.521 rows=1233 loops=1)
         Remote Filter: (state_code = 'CA'::text)

I know where the misestimation comes from: the selectivity of state_code='CA' has been multiplied by the selectivity of country_code='US' as if each state code had cities in each countries. The knowledge I have about this data is that there's usually only one country. Then can multiply the query planner estimation by the number of countries, which is 198. It is actually a bit less, as some other countries have cities in a 'CA' state for a country that is not 'US'. Anyway, this is approximative. As I've seen in the previous test that the average correlation is 190 I'll use it, hinting with Rows( countries cities *190 ):

yugabyte=# explain (analyze, summary off)
/*+ Rows( countries cities *190 ) */
select * from countries join cities using(country_id)
;
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=133.25..79104.23 rows=28608870 width=62) (actual time=4.083..148.959 rows=150573 loops=1)
   Hash Cond: (cities.country_id = countries.country_id)
   ->  Seq Scan on cities  (cost=7.42..78574.69 rows=150573 width=50) (actual time=1.753..95.049 rows=150573 loops=1)
   ->  Hash  (cost=123.36..123.36 rows=198 width=16) (actual time=2.310..2.310 rows=198 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 18kB
         ->  Seq Scan on countries  (cost=3.83..123.36 rows=198 width=16) (actual time=0.663..2.209 rows=198 loops=1)

When joining tables, you can use the Rows() hint to adjust the estimated number of rows after joining the tables. Provide aliases for the tables whose rows you want to correct and the correction. You can use # to set an absolute value, but * is better to use as a multiplication factor that will still be correct when the tables grow. The number doesn't have to be exact. For instance, if you join another table from this example, a misestimate of 6 rows instead of 1124 could lead to a bad decision for a Nested Loop join, which would execute thousands of loops, but an estimation of 1000 or 2000 will not be problematic. In such cases, optimizer hints are the only way to provide good input to the query planner.

Top comments (2)

Sadeq Dousti • Dec 10 '23

Thanks a lot for the nice article! I always learn a lot from your posts.

Regarding the approach with pg_hint_plan, I investigated a bit if we can use an alternative way - because hard coding something like Rows( countries cities *190 ) is not always feasible.

I found out using a materialized CTE can also help:

with t as materialized
         (select * from countries join cities using (country_id)
          where state_code = 'CA')
select * from t where t.country_code = 'US';

The above query also results in a Hash Join as opposed to a Nested Loop:

CTE Scan on t  (cost=3812.03..3840.72 rows=6 width=220) (actual time=17.872..22.076 rows=1124 loops=1)
  Filter: (country_code = 'US'::text)
  Rows Removed by Filter: 109
  CTE t
    ->  Hash Join  (cost=6.46..3812.03 rows=1275 width=74) (actual time=2.235..21.182 rows=1233 loops=1)
          Hash Cond: (cities.country_id = countries.country_id)
          ->  Seq Scan on cities  (cost=0.00..3802.16 rows=1275 width=62) (actual time=2.161..20.700 rows=1233 loops=1)
                Filter: (state_code = 'CA'::text)
                Rows Removed by Filter: 149340
          ->  Hash  (cost=3.98..3.98 rows=198 width=16) (actual time=0.059..0.060 rows=198 loops=1)
                Buckets: 1024  Batches: 1  Memory Usage: 18kB
                ->  Seq Scan on countries  (cost=0.00..3.98 rows=198 width=16) (actual time=0.011..0.026 rows=198 loops=1)

I'd love to hear your opinion on whether my approach is good or bad. Thanks you!

Franck Pachot YugabyteDB • Dec 10 '23

When materializing is used, Nested Loop is prevented as the join condition cannot be pushed down. However, it does not solve the issue of the result being estimated to be 6 rows. If you add a join to another table, the bad join method will be used. In my opinion, it is better to have accurate cardinalities and let the query planner decide, rather than directly forcing a plan.