IN() Index Scan in PostgreSQL 17 and YugabyteDB LSM Tree

#yugabytedb #distributed #postgres #database

TL;DR: This optimization in PG17 was made for B-tree indexes. YugabyteDB uses LSM Tree indexes and has already implemented this to optimize distributed joins.

When querying with an IN() list or Scalar Array Operation Expression (SAOP), PostgreSQL, until PG16, performed one Index Scan for each value.

Demonstration

For each example, I created a table with an index on 'value' that includes 'id'. I then inserted rows, performed a vacuum, analyzed the table, and checked the number of scans from 'pg_stat_user_tables'.

postgres=# create table demo ( id bigserial primary key, value int );
CREATE TABLE

postgres=# create index on demo ( value asc , id asc);
CREATE INDEX

postgres=# insert into demo ( value ) select generate_series(1,1000000);
INSERT 0 1000000

postgres=# vacuum analyze demo;
VACUUM

postgres=# \! sleep 5

postgres=# select seq_scan, seq_tup_read, idx_scan, idx_tup_fetch
           from pg_stat_user_tables where relid='demo'::regclass
;

 seq_scan | seq_tup_read | idx_scan | idx_tup_fetch
----------+--------------+----------+---------------
        2 |            0 |        0 |             0
(1 row)

I checked the table statistics and found no index scans. The sequential scans are a result of the table and index creation.

PostgreSQL 16

With this table created, I query for a list of 42 values.

postgres=# explain (analyze, buffers) select id
           from demo
           where value in (83, 31, 11, 19, 137, 149, 79, 167, 3, 47, 59, 5, 113, 71, 53, 163, 109, 101, 2, 107, 17, 179, 131, 43, 23, 67, 151, 97, 139, 61, 73, 157, 29, 89, 13, 127, 41, 7, 103, 173, 181, 37)
;
                                                                                       QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using demo_value_id_idx on demo  (cost=0.42..186.58 rows=42 width=8) (actual time=0.015..0.063 rows=42 loops=1)
   Index Cond: (value = ANY ('{83,31,11,19,137,149,79,167,3,47,59,5,113,71,53,163,109,101,2,107,17,179,131,43,23,67,151,97,139,61,73,157,29,89,13,127,41,7,103,173,181,37}'::integer[]))
   Heap Fetches: 0
   Buffers: shared hit=127
 Planning:
   Buffers: shared hit=22
 Planning Time: 0.218 ms
 Execution Time: 0.075 ms

This query has read 127 buffers to retrieve 42 rows, averaging three buffers per row. It's clear that it has traversed through the B-Tree levels (one root, one branch, one leaf) for each value. Even if this operation is carried out from shared memory, it could pose scalability issues because multiple statements running this, even for different values, will have to pin the same root and branches.

Looking at the table statistics, it is evident that this Index Scan was executed 42 times, even if that's not visible in the execution plan (it shows only one loop - there's no visible INLIST ITERATOR like Oracle).

postgres=# select seq_scan, seq_tup_read, idx_scan, idx_tup_fetch
           from pg_stat_user_tables where relid='demo'::regclass
;

 seq_scan | seq_tup_read | idx_scan | idx_tup_fetch
----------+--------------+----------+---------------
        2 |            0 |       42 |             0

This has been improved in PostgreSQL 17.

PostgreSQL 17

I've run the same example with PostgreSQL 17

postgres=# explain (analyze, buffers) select id
           from demo
           where value in (83, 31, 11, 19, 137, 149, 79, 167, 3, 47, 59, 5, 113, 71, 53, 163, 109, 101, 2, 107, 17, 179, 131, 43, 23, 67, 151, 97, 139, 61, 73, 157, 29, 89, 13, 127, 41, 7, 103, 173, 181, 37)
;

                                                                                       QUERY PLAN                    
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using demo_value_id_idx on demo  (cost=0.42..186.58 rows=42 width=8) (actual time=0.025..0.029 rows=42 loops=1)
   Index Cond: (value = ANY ('{83,31,11,19,137,149,79,167,3,47,59,5,113,71,53,163,109,101,2,107,17,179,131,43,23,67,151,97,139,61,73,157,29,89,13,127,41,7,103,173,181,37}'::integer[]))
   Heap Fetches: 0
   Buffers: shared hit=4
 Planning:
   Buffers: shared hit=20 read=1
 Planning Time: 0.155 ms
 Execution Time: 0.042 ms

postgres=# select seq_scan, seq_tup_read, idx_scan, idx_tup_fetch from pg_stat_user_tables where relid='demo'::regclass;
 seq_scan | seq_tup_read | idx_scan | idx_tup_fetch
----------+--------------+----------+---------------
        2 |            0 |        1 |             0

The table statistics clearly show that only one Index Scan was performed to retrieve the 42 values. The number of buffers read is small, just four blocks. This probably includes one root, one branch, and two leaf blocks.

My query resulted in only a few leaf blocks because I was searching for values smaller than 200 within a range of one million. If I were to query with values scattered across the index range, the number of buffers would be higher:

postgres=# explain (analyze, buffers) select id
 from demo
 where value in (100083, 100031, 100011, 100019, 100137, 100149, 100079, 100167, 1003, 100047, 100059, 1005, 100113, 100071, 100053, 100163, 100109, 100101, 1002, 100107, 100017, 100179, 100131, 10043, 10023, 10067, 100151, 10097, 100139, 10061, 10073, 100157, 10029, 10089, 10013, 100127, 10041, 1007, 100103, 100173, 100181, 10037)
;
                                                                                                                                                           QUERY PLAN                                                                     
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using demo_value_id_idx on demo  (cost=0.42..186.58 rows=42 width=8) (actual time=0.026..0.052 rows=42 loops=1)
   Index Cond: (value = ANY ('{100083,100031,100011,100019,100137,100149,100079,100167,1003,100047,100059,1005,100113,100071,100053,100163,100109,100101,1002,100107,100017,100179,100131,10043,10023,10067,100151,10097,100139,10061,10073,100157,10029,10089,10013,100127,10041,1007,100103,100173,100181,10037}'::integer[]))
   Heap Fetches: 0
   Buffers: shared hit=10
 Planning Time: 0.081 ms
 Execution Time: 0.066 ms

In any case, the value will always be less than or equal to the number of values in the IN() list and does not have to be read from the B-Tree root again.

YugabyteDB

I am using YugabyteDB (version 2024.1.2) and created the same table. There's no need to VACUUM, as the MVCC garbage collection occurs transparently in the distributed storage. YugabyteDB shows no buffers for distributed tables because it bypasses the monolithic shared buffers to scale horizontally. However, the dist option indicates the number of distributed read requests, and the debug option shows additional internal metrics when reading from the LSM Tree on each node.

yugabyte=# create table demo ( id bigserial primary key, value int );
CREATE TABLE
yugabyte=# create index on demo ( value asc , id asc);
CREATE INDEX
yugabyte=# insert into demo ( value ) select generate_series(1,1000000);
INSERT 0 1000000
yugabyte=# analyze demo;
ANALYZE

yugabyte=# explain (analyze, buffers, dist, debug, summary off) select id
 from demo
 where value in (83, 31, 11, 19, 137, 149, 79, 167, 3, 47, 59, 5, 113, 71, 53, 163, 109, 101, 2, 107, 17, 179, 131, 43, 23, 67,
151, 97, 139, 61, 73, 157, 29, 89, 13, 127, 41, 7, 103, 173, 181, 37)
;
                                                                                       QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------
 Index Only Scan using demo_value_id_idx on demo  (cost=180.00..983.22 rows=42 width=8) (actual time=0.735..0.749 rows=42 loops=
1)
   Index Cond: (value = ANY ('{83,31,11,19,137,149,79,167,3,47,59,5,113,71,53,163,109,101,2,107,17,179,131,43,23,67,151,97,139,6
1,73,157,29,89,13,127,41,7,103,173,181,37}'::integer[]))
   Heap Fetches: 0
   Storage Index Read Requests: 1
   Storage Index Read Execution Time: 0.605 ms
   Storage Index Rows Scanned: 42
   Metric rocksdb_block_cache_hit: 3.000
   Metric rocksdb_block_cache_index_hit: 1.000
   Metric rocksdb_block_cache_data_hit: 1.000
   Metric rocksdb_block_cache_bytes_read: 32751.000
   Metric rocksdb_number_db_seek: 28.000
   Metric rocksdb_number_db_next: 109.000
   Metric rocksdb_number_db_seek_found: 28.000
   Metric rocksdb_number_db_next_found: 109.000
   Metric rocksdb_iter_bytes_read: 9641.000
   Metric rocksdb_block_cache_single_touch_hit: 1.000
   Metric rocksdb_block_cache_single_touch_bytes_read: 32699.000
   Metric rocksdb_block_cache_multi_touch_hit: 2.000
   Metric rocksdb_block_cache_multi_touch_bytes_read: 52.000
   Metric docdb_keys_found: 42.000
   Metric ql_read_latency: sum: 148.000, count: 1.000
   Estimated Seeks: 43
   Estimated Nexts: 86
   Estimated Docdb Result Width: 9

One "Index Read Request," indicates that there is only one Index Scan to read the 42 values. In the LSM Tree, which is implemented based on RocksDB, seeking to a key is accounted for as "rocksdb_number_db_seek" and reading the following key is accounted for as "rocksdb_number_db_next."

In YugabyteDB, "seek" is equivalent to "buffers" in PostgreSQL. They are the most important metrics for evaluating the cost of a query. They represent a random read to access a set of rows stored together.

If the 42 values were scattered throughout the LSM Tree, we would have expected to see 42 seek operations. However, some of the values are close enough to be read with faster next operations. In this case, we observed 28 instances of rocksdb_number_db_seek and 109 instances of rocksdb_number_db_next.

With more scattered values, it shows a little more seek and less next:

yugabyte=# explain (analyze, buffers, dist, debug, summary off) select id
 from demo
 where value in (100083, 100031, 100011, 100019, 100137, 100149, 100079, 100167, 1003, 100047, 100059, 1005, 100113, 100071, 100
053, 100163, 100109, 100101, 1002, 100107, 100017, 100179, 100131, 10043, 10023, 10067, 100151, 10097, 100139, 10061, 10073, 100157, 10029, 10089, 10013, 100127, 10041, 1007, 100103, 100173, 100181, 10037)
;

                           QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using demo_value_id_idx on demo  (cost=180.00..983.22 rows=42 width=8) (actual time=0.748..0.761 rows=42 loops=1)
   Index Cond: (value = ANY ('{100083,100031,100011,100019,100137,100149,100079,100167,1003,100047,100059,1005,100113,100071,100053,100163,100109,100101,1002,100107,100017,100179,100131,10043,10023,10067,100151,10097,100139,10061,10073,100157,10029,10089,10013,100127,10041,1007,100103,100173,100181,10037}'::integer[]))
   Heap Fetches: 0
   Storage Index Read Requests: 1
   Storage Index Read Execution Time: 0.593 ms
   Storage Index Rows Scanned: 42
   Metric rocksdb_block_cache_hit: 5.000
   Metric rocksdb_block_cache_index_hit: 1.000
   Metric rocksdb_block_cache_data_hit: 3.000
   Metric rocksdb_block_cache_bytes_read: 98236.000
   Metric rocksdb_number_db_seek: 32.000
   Metric rocksdb_number_db_next: 113.000
   Metric rocksdb_number_db_seek_found: 32.000
   Metric rocksdb_number_db_next_found: 113.000
   Metric rocksdb_iter_bytes_read: 10292.000
   Metric rocksdb_block_cache_multi_touch_hit: 5.000
   Metric rocksdb_block_cache_multi_touch_bytes_read: 98236.000
   Metric docdb_keys_found: 42.000
   Metric ql_read_latency: sum: 167.000, count: 1.000
   Estimated Seeks: 43
   Estimated Nexts: 86
   Estimated Docdb Result Width: 9
(22 rows)

YugabyteDB implemented this optimization early because the overhead of performing multiple index scans is high in distributed databases, where each scan can result in a network call.
YugabyteDB uses this optimization to minimize the number of iterations in a Nested Loop join.

Batched Nested Loop

In PostgreSQL 17, the scalar array optimization is used with an IN() list but not with a join. For example, I have defined my list of values in a WITH clause:

postgres=# explain (analyze, buffers, summary off)
 with my_values(value) as (
 values (100083), (100031), (100011), (100019), (100137), (100149), (100079), (100167), (1003), (100047), (100059), (1005), (100113), (100071), (100053), (100163), (100109), (100101), (1002), (100107), (100017), (100179), (100131), (10043), (10023), (10067), (100151), (10097), (100139), (10061), (10073), (100157), (10029), (10089), (10013), (100127), (10041), (1007), (100103), (100173), (100181), (10037)
 )
 select id from demo where value in ( select value from my_values );
                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=1.05..188.06 rows=42 width=8) (actual time=0.038..0.114 rows=42 loops=1)
   Buffers: shared hit=127
   ->  HashAggregate  (cost=0.63..1.05 rows=42 width=4) (actual time=0.019..0.024 rows=42 loops=1)
         Group Key: "*VALUES*".column1
         Batches: 1  Memory Usage: 24kB
         ->  Values Scan on "*VALUES*"  (cost=0.00..0.53 rows=42 width=4) (actual time=0.001..0.008 rows=42 loops=1)
   ->  Index Only Scan using demo_value_id_idx on demo  (cost=0.42..4.44 rows=1 width=12) (actual time=0.002..0.002 rows=1 loops=42)
         Index Cond: (value = "*VALUES*".column1)
         Heap Fetches: 0
         Buffers: shared hit=127

PostgreSQL executes one nested loop per value, and the 42 Index Scan is identifiable with 'loops=42'. The number of buffers accessed is equivalent to that of PostgreSQL 16.

YugabyteDB optimizes this by batching the values from the outer table and pushing down an array for the inner table nested loop join.

yugabyte=# explain (analyze, buffers, dist, debug, summary off)
 with my_values(value) as (
 values (100083), (100031), (100011), (100019), (100137), (100149), (100079), (100167), (1003), (100047), (100059), (1005), (100113), (100071), (100053), (100163), (100109), (100101), (1002), (100107), (100017), (100179), (100131), (10043), (10023), (10067), (100151), (10097), (100139), (10061), (10073), (100157), (10029), (10089), (10013), (100127), (10041), (1007), (100103), (100173), (100181), (10037)
 )
 select id from demo where value in ( select value from my_values );
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 YB Batched Nested Loop Join  (cost=181.47..999.06 rows=42 width=8) (actual time=0.820..0.860 rows=42 loops=1)
   Join Filter: (demo.value = my_values.value)
   CTE my_values
     ->  Values Scan on "*VALUES*"  (cost=0.00..0.53 rows=42 width=4) (actual time=0.001..0.008 rows=42 loops=1)
   ->  HashAggregate  (cost=0.94..1.36 rows=42 width=4) (actual time=0.039..0.043 rows=42 loops=1)
         Group Key: my_values.value
         ->  CTE Scan on my_values  (cost=0.00..0.84 rows=42 width=4) (actual time=0.004..0.023 rows=42 loops=1)
   ->  Index Only Scan using demo_value_id_idx on demo  (cost=180.00..997.14 rows=42 width=12) (actual time=0.701..0.715 rows=42 loops=1)
         Index Cond: (value = ANY (ARRAY[my_values.value, $2, $3, ..., $1024]))
         Heap Fetches: 0
         Storage Index Read Requests: 1
         Storage Index Read Execution Time: 0.568 ms
         Storage Index Rows Scanned: 42
         Metric rocksdb_block_cache_hit: 5.000
         Metric rocksdb_block_cache_index_hit: 1.000
         Metric rocksdb_block_cache_data_hit: 3.000
         Metric rocksdb_block_cache_bytes_read: 98236.000
         Metric rocksdb_number_db_seek: 32.000
         Metric rocksdb_number_db_next: 113.000
         Metric rocksdb_number_db_seek_found: 32.000
         Metric rocksdb_number_db_next_found: 113.000
         Metric rocksdb_iter_bytes_read: 10292.000
         Metric rocksdb_block_cache_multi_touch_hit: 5.000
         Metric rocksdb_block_cache_multi_touch_bytes_read: 98236.000
         Metric docdb_keys_found: 42.000
         Metric ql_read_latency: sum: 160.000, count: 1.000
         Estimated Seeks: 43
         Estimated Nexts: 86
         Estimated Docdb Result Width: 14

When using YugabyteDB, even if the list of values comes from a table instead of an array, it only performs one read request to the inner table to retrieve all values. This can be seen in the execution plan with the 'YB Batched Nested Loop Join', which batches the request. I used a Common Table Expression for simplicity, but it works with any table to optimize IN(select) or joins.