Franck Pachot for YugabyteDB

Posted on Apr 4, 2022 • Edited on Aug 2, 2023

Index Skip Scan in YugabyteDB 🚀🐘

#postgres #yugabyted #distributed #sql

Generally, people recommend putting the most selective column first in an index. Because, as indexes can seek directly into a specific prefix, a selective value here guarantees few rows to read. But there's a more important concern when indexing: you should optimize all the critical use cases with the minimum index maintenance.

For example, imagine that, on a table with columns A and B, you may queries with the following predicates:

where A=$1
where B=$1
where A=$1 and B=$2
where A=$1 and B>$2
where A>$1 and B=$2
where A>$1 and B>$2

It is a bad idea to create two single-column indexes on A and B because because only one will be used, for one index condition, and the other predicate being filtered afterwards. In this db<>fiddle on PostgreSQL, only the Index Cond: is fast access in the index structure. The Filter: line shows predicates that are processed only after retrieving all rows.

You can optimize all of these queries with two indexes, on (A ASC, B ASC) and (B ASC, A ASC). I'm using ASC here which is the default for PostgreSQL but not for YugabyteDB (to distribute by default, the first column is HASH when not mentioned). If you create the database with colocation, ASC is the default and you can stick to a PostgreSQL compatible syntax: PRIMARY KEY(A, B).

The index starting with A ASC can be used for fast access to A=$1 or A>$1 and the one starting with B ASC can be used for fast access to B=$1 or B>$1. The second column can be used for further filtering once the first one already seek to the start if the range to scan for it.

The goal of this post is to show that, in YugabyteDB, one index can serve all queries. This will lower the index maintenance to its minimum, especially when the unique index is the primary key. And having no secondary index provides higher throughput, because reads and writes can use the fast-path (no cross-shard transaction). Then, if only one index key is sufficient, which one to choose?

🐘 PostgreSQL

PostgreSQL has no Skip Scan (aka "loose indexscan" in MySQL or "jump scan" in DB2):

The index on (A ASC, B ASC) can serve all predicates mentioned above, except where B=1 (see this db<>fiddle)
The index on (B ASC, A ASC) can serve all except A=$1 (see this db<>fiddle)

Of course, the efficiency also depends on selectivity. But basically, this means that I need two indexes in PostgreSQL. Note that I've read that the Timescale extension has some Skip Scan, ask @ryanbooz about it if you want to know more. I'll show how (and when) we do that in YugabyteDB, which may be different.

🚀 YugabyteDB

Two storage characteristics of YugabyteDB provide additional optimizations:

The primary key holds all column value without any additional read (because YugabyteDB doesn't store the row in a distinct heap table like PostgreSQL)
And Index Only Scan on secondary key, doesn't have to read the table (because YugabyteDB stores the visibility in the index unlike PostgreSQL storing it only with the table)

This design is critical for a distributed database where single-shard transactions are faster, and where reading the table in addition to the index entry involves cross-shard transactions and cross-node calls.

One additional characteristics of YugabyteDB is about an additional access path within the index:

Reading from an index can seek to multiple places, to skip the first column, and still use an index condition on other columns.

Then, one index only can serve all those predicates when the goal is to reduce the overhead on table's updates.

Let's do it, declaring it as the primary key, with 10 million rows, manually splitting into 3 shards:

create table demo ( A int, B int, primary key(A ASC, B ASC) )
split at values ( (0,0),(33,0),(66,0) ) ;

insert into demo select a,b 
 from generate_series(1,100) a, generate_series(1,100000) b;

explain (costs off, analyze) select * from demo
 where A=50            limit 10;

explain (costs off, analyze) select * from demo
 where B=5000          limit 10;

explain (costs off, analyze) select * from demo
 where A=50 and B=5000 limit 10;

explain (costs off, analyze) select * from demo
 where A=50 and B>5000 limit 10;

explain (costs off, analyze) select * from demo
 where A>50 and B=5000 limit 10;

explain (costs off, analyze) select * from demo
 where A>50 and B>5000 limit 10;

I have put different cardinalities on purpose: low selectivity on A with 100 distinct values, and higher selectivity on B with 100000 values. Because, even if the same index can be used for both, one is less expensive than the other depending on the selectivity. This test will help to decide the order of columns and we will see that the most selective first doesn't apply here.

I'm running this in YugabyteDB 2.13 which implements this technique for range indexes. You can follow #11881 for the implementation in hash sharded tables or indexes.

I'll run all queries, showing the execution plan and some execution statistics (gathered with my experimental ybwr)

where A=50 limit 10;

yugabyte=# explain (costs off, analyze) select * from demo
            where A=50            limit 10;

                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Limit (actual time=0.773..0.779 rows=10 loops=1)
   ->  Index Scan using demo_pkey on demo (actual time=0.772..0.777 rows=10 loops=1)
         Index Cond: (a = 50)
 Planning Time: 0.056 ms
 Execution Time: 0.806 ms

This is fast, seeking directly to the index entry starting with 50 and reading 10 rows from there.

        row_name         | rocksdb_#_db_seek | rocksdb_#_db_next
-------------------------+-------------------+-------------------
 yugabyte demo 10.0.0.61 |                 1 |                11

The statistics I displayed are the number of "seeks" (getting directly to one value) and "next" reads (each row or column is a value in the YugabyteDB distributed storage).

where B=5000 limit 10;

yugabyte=# explain (costs off, analyze) select * from demo
            where B=5000          limit 10;

                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Limit (actual time=1.746..1.752 rows=10 loops=1)
   ->  Index Scan using demo_pkey on demo (actual time=1.746..1.750 rows=10 loops=1)
         Index Cond: (b = 5000)
 Planning Time: 0.054 ms
 Execution Time: 1.779 ms

This one is interesting because it exposes the Skip Scan mecanism: the index starts with A where I've no filter but can still use the index structure to find the values of B with very low overhead when compared to the previous case.

        row_name         | rocksdb_#_db_seek | rocksdb_#_db_next
-------------------------+-------------------+-------------------
 yugabyte demo 10.0.0.61 |                22 |                53
 yugabyte demo 10.0.0.62 |                 1 |

The execution statistics explain what happens. The first seek goes to the first value of A, which is A=1,B=1 and then seeks to the value A=1,B=5000 from there. The reads the next value that is A=1,B=5001 which doesn't verify the predicate. It then seeks to the next value of A which is A=2,B=1, then seeks to A=2,B=5000. And so on until it gets the 10 rows required. There are further optimizations to replace short "seeks" with multiple "nexts", and multiple tablets may be involved. But, roughly, it had to seek into 10 different places in the index for A, and then for B, which is about 20 seeks.

You now understand the beauty if this feature, which makes sense when there are not too many values of A. Like when the key is prefixed with a tenant ID, or subsidiary ID, or a type... anything with low cardinality. And you see how the "most selective first" may not be the best recommandation in this case.

where A=50 and B=5000 limit 10;

yugabyte=# explain (costs off, analyze) select * from demo
            where A=50 and B=5000 limit 10;

                                     QUERY PLAN
------------------------------------------------------------------------------------
 Limit (actual time=0.661..0.663 rows=1 loops=1)
   ->  Index Scan using demo_pkey on demo (actual time=0.660..0.662 rows=1 loops=1)
         Index Cond: ((a = 50) AND (b = 5000))
 Planning Time: 0.060 ms
 Execution Time: 0.687 ms

This one is similar to the A=50 case which retreived 10 rows with 1 seek and 11 nexts. But here, only one row is fetched:

        row_name         | rocksdb_#_db_seek | rocksdb_#_db_next
-------------------------+-------------------+-------------------
 yugabyte demo 10.0.0.61 |                 1 |                 1

This is where compound indexes are useful. Without it, an index on a lonely A would have to read all entries for A=50, get them back to the query layer, then distribute the reads to the table for those entries and filter on B=5000 afterwards, visible with a rows removed by filter.
You can try this:

yugabyte=# create index demo_a on demo(a asc);
yugabyte=# explain (costs off, analyze)
           /*+ IndexScan(demo demo_a) */
           select * from demo
           where A=50 and B=5000 limit 10;

                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Limit (actual time=65.417..1253.115 rows=1 loops=1)
   ->  Index Scan using demo_a on demo (actual time=65.416..1253.113 rows=1 loops=1)
         Index Cond: (a = 50)
         Filter: (b = 5000)
         Rows Removed by Filter: 99999
 Planning Time: 0.102 ms
 Execution Time: 1253.583 ms
yugabyte=# drop index demo_a;

This takes 1 second here but can be longer on a larger table that doesn't fit in cache. It reads 99999+1 index entries, from the range scan, but the worse are the many the hops to the table, for each index entry:

         row_name          | rocksdb_#_db_seek | rocksdb_#_db_next
---------------------------+-------------------+-------------------
 yugabyte demo 10.0.0.62   |            100000 |            100000
 yugabyte demo_a 10.0.0.61 |                98 |            100097
(2 rows)

This happens with secondary indexes. And you may wonder why it had to read from the table to get the value of B that should be there in the index entry (because a secondary index references the primary key). Basically, the primary key is there but encoded. It is probably possible to get the values from it, follow #10169 for this improvement.

This was about secondary indexes. Let's go back to my example where (A,B) is primary key and there's no additional hop as we are already in the table.

where A=50 and B>5000 limit 10;

yugabyte=# explain (costs off, analyze) select * from demo
            where A=50 and B>5000 limit 10;

                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Limit (actual time=0.710..1.138 rows=10 loops=1)
   ->  Index Scan using demo_pkey on demo (actual time=0.709..1.136 rows=10 loops=1)
         Index Cond: ((a = 50) AND (b > 5000))
 Planning Time: 0.060 ms
 Execution Time: 1.166 ms

Here we have to read more rows, but this is still fast when the result is limited to 10 rows. This is typical from OLTP and critical queries here should have the simplest access path: Index Scan from the primary key, or Index Only Scan from a secondary index.

        row_name         | rocksdb_#_db_seek | rocksdb_#_db_next
-------------------------+-------------------+-------------------
 yugabyte demo 10.0.0.62 |                 2 |                22

More rows are read but immediately filtered from the index because we read one range from A=50, B=5001 to A=50, B=5010.

where A>50 and B=5000 limit 10;

yugabyte=# explain (costs off, analyze) select * from demo
            where A>50 and B=5000 limit 10;

                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Limit (actual time=0.732..1.269 rows=10 loops=1)
   ->  Index Scan using demo_pkey on demo (actual time=0.731..1.267 rows=10 loops=1)
         Index Cond: ((a > 50) AND (b = 5000))
 Planning Time: 0.057 ms
 Execution Time: 1.294 ms

This is another case of Skip Scan even if I have a predicate on the first column A. We seek into the first row to return, A=51,B=5000 and then have to seek to A=52,B=5000 which is at a different place. And so on for the 10 rows to return.

        row_name         | rocksdb_#_db_seek | rocksdb_#_db_next
-------------------------+-------------------+-------------------
 yugabyte demo 10.0.0.61 |                42 |               102

Thanks to Index Skip Scan, the time complexity is limited even if it would be faster with an index on (B ASC, A ASC) for this specific case. This is a balance between a query response time that is good enough, and the overhead to maintain an additional secondary index.

where A>50 and B>5000 limit 10;

yugabyte=# explain (costs off, analyze) select * from demo
            where A>50 and B>5000 limit 10;

                                        QUERY PLAN
-------------------------------------------------------------------------------------------
 Limit (actual time=4292.526..4292.928 rows=10 loops=1)
   ->  Index Scan using demo_pkey on demo (actual time=4292.526..4292.926 rows=10 loops=1)
         Index Cond: ((a > 50) AND (b > 5000))
 Planning Time: 0.055 ms
 Execution Time: 4292.978 ms

This is the most difficult predicate because we have two ranges and even if Skip Scan applies there, this takes more than expected to retrieve 10 rows

        row_name         | rocksdb_#_db_seek | rocksdb_#_db_next
-------------------------+-------------------+-------------------
 yugabyte demo 10.0.0.61 |              9503 |            104524

Seeing those numbers, the problem is not only because of the Index Skip Scan seeks. Because I should have my 10 rows from A=51. If you run the same with a higher limit, reading more rows:

yugabyte=# explain (costs off, analyze) select * from demo
            where A>50 and B>5000 limit 1000;

                                        QUERY PLAN
-------------------------------------------------------------------------------------------
 Limit (actual time=339.555..343.271 rows=1000 loops=1)
   ->  Index Scan using demo_pkey on demo (actual time=339.554..343.199 rows=1000 loops=1)
         Index Cond: ((a > 50) AND (b > 5000))
 Planning Time: 0.064 ms
 Execution Time: 343.345 ms
(5 rows)

This is faster. The reason is, in my opinion, an incorrect pushdown of the limit and the query reads all the 95000 rows for A=50, by pages of 10, rather than stopping at the first page. I have opened issue #11965 for this.

[Update]: this is fixed and is now optimal. Here is the execution plan in 2.17.1 with the dist option to show the number of calls to the storage:

yugabyte=# explain (costs off, analyze, dist) select * from demo
            where A>50 and B>5000 limit 1000;
                                      QUERY PLAN
---------------------------------------------------------------------------------------
 Limit (actual time=2.551..2.982 rows=1000 loops=1)
   ->  Index Scan using demo_pkey on demo (actual time=2.549..2.894 rows=1000 loops=1)
         Index Cond: ((a > 50) AND (b > 5000))
         Storage Index Read Requests: 1
         Storage Index Execution Time: 4.000 ms
 Planning Time: 0.069 ms
 Execution Time: 3.051 ms
 Storage Read Requests: 1
 Storage Write Requests: 0
 Storage Execution Time: 4.000 ms
 Peak Memory Usage: 8 kB
(11 rows)

Anyway, this is a case where an index starting on B may be more efficient:

yugabyte=# create index demo_ba on demo ( B ASC, A ASC )
            split at values ( (0,0),(33000,0),(66000,0) ) ;

I force it with a hint:

yugabyte=# explain (costs off, analyze) 
           /*+ IndexOnlyScan(demo demo_ba) */ 
           select * from demo
            where A>50 and B>5000 limit 10;

                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Limit (actual time=2.735..3.204 rows=10 loops=1)
   ->  Index Only Scan using demo_ba on demo (actual time=2.734..3.202 rows=10 loops=1)
         Index Cond: ((b > 5000) AND (a > 50))
         Heap Fetches: 0
 Planning Time: 0.103 ms
 Execution Time: 3.235 ms

This index is also faster when querying on B:

yugabyte=# explain (costs off, analyze) select * from demo
            where B=5000          limit 10;

                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Limit (actual time=0.764..0.770 rows=10 loops=1)
   ->  Index Only Scan using demo_ba on demo (actual time=0.764..0.768 rows=10 loops=1)
         Index Cond: (b = 5000)
         Heap Fetches: 0
 Planning Time: 0.060 ms
 Execution Time: 0.796 ms

          row_name          | rocksdb_#_db_seek | rocksdb_#_db_next
----------------------------+-------------------+-------------------
 yugabyte demo_ba 10.0.0.61 |                 1 |                11

what is this overhead?

Here is a simple example, inserting 100000 with this additional index, and without:

yugabyte=# \timing on
Timing is on.

yugabyte=# insert into demo select 1000,generate_series(1,100000);
INSERT 0 100000
Time: 14458.442 ms (00:14.458)

yugabyte=# drop index demo_ba;
DROP INDEX
Time: 187.206 ms

yugabyte=# insert into demo select 2000,generate_series(1,100000);
INSERT 0 100000
Time: 9148.974 ms (00:09.149)

The difference is a gain of 30% in inserting new rows here. So, if you need an additional index, it is probably better to keep it and scale out to guarantee the same throughput. But if you don't need it, thanks to Skip Scan, better have the minimum indexes to maintain.

In summary

Of course, having an index (the primary key, or a secondary index) that starts with the column on which you range scan is the most efficient one for your query. However, another index that has the same column at a different position may also be used, with acceptable performance, and reduces the overhead of maintaining indexes during data ingest, deletes, or updates on the indexed column.

The best approach is to list all your access patterns, and define the minimal set of indexes you need for your performance requirements. Checking the execution plan, and performance metrics, is the best way to have predictable performance and scalability. I prefer this approach to "rules of thumbs" like "most selective first". If you have any question, get help from the YugabyteDB community

This was a synthetic test case. But you are using a SQL database and you probably have some many-to-many association tables. They usually have a primary key that includes all columns, and need a secondary index, also with all columns, to navigate in the other way. Now, you know that, maybe, the primary key is sufficient in YugabyteDB.

If you want to play with it, here is the db<>fiddle:
https://dbfiddle.uk/jvVj05Hz

DEV Community