DEV Community

Franck Pachot for YugabyteDB Distributed PostgreSQL Database

Posted on • Edited on

YugabyteDB range sharding: ASC or DESC? (Backward Scan in๐Ÿš€)

With PostgreSQL, using B-Tree indexes, scanning an index forward or backward is usually similar in performance, because the leaf pages are linked to the next and previous ones. YugabyteDB distributed SQL database stores tables and indexes in LSM-Tree structures, where entries are append only, and are navigated differently, with skip lists. They can also be scanned forward or backward but the cost of it is different.

I start a lab:

docker run --name yb --hostname yb --env PGHOSTNAME=yb \
-d --rm --cap-add=SYS_ADMIN -p7000:7000 -p9000:9000 -p5433:5433 \
yugabytedb/yugabyte:2.13.2.0-b135 bin/yugabyted start \
--daemon=false 
sleep 15
docker exec -it yb ysqlsh
Enter fullscreen mode Exit fullscreen mode

I install my ybwr utility to gather some execution statistics and create a demo table:

\! curl -s https://raw.githubusercontent.com/FranckPachot/ybdemo/main/docker/yb-lab/client/ybwr.sql | grep -v '\watch' > ybwr.sql
\i ybwr.sql
create table demo ( id int, primary key(id ASC) );
insert  into demo ( id ) select generate_series (1,1000);
Enter fullscreen mode Exit fullscreen mode

I have defined the primary key as range sharding in ASCending order.

I execute a query reading in ASCending order, like the index, and another one in DESCending order

execute snap_reset;
explain analyze select id from demo order by id ASC;
execute snap_table;
explain analyze select id from demo order by id DESC;
execute snap_table;
Enter fullscreen mode Exit fullscreen mode

The result on this 1000 rows table is:

yugabyte=# execute snap_reset;

 ybwr metrics
--------------
(0 rows)

yugabyte=# explain analyze select id from demo order by id ASC;

                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Index Scan using demo_pkey on demo  (cost=0.00..114.00 rows=1000 width=4) (actual time=4.584..5.417 rows=1000 loops=1)
 Planning Time: 5.227 ms
 Execution Time: 6.013 ms
(3 rows)

yugabyte=# execute snap_table;

                     row_name                      | rocksdb_#_db_seek | rocksdb_#_db_next
---------------------------------------------------+-------------------+-------------------
 yugabyte demo yb b701907f833d402fbe68835cdd7fc2bb |                 1 |               999
(1 row)

yugabyte=# explain analyze select id from demo order by id DESC;

                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Index Scan Backward using demo_pkey on demo  (cost=0.00..124.00 rows=1000 width=4) (actual time=1.778..21.633 rows=1000 loops=1)
 Planning Time: 0.102 ms
 Execution Time: 24.215 ms
(3 rows)

yugabyte=# execute snap_table;

                     row_name                      | rocksdb_#_db_seek | rocksdb_#_db_next
---------------------------------------------------+-------------------+-------------------
 yugabyte demo yb b701907f833d402fbe68835cdd7fc2bb |              2062 |              1014
(1 row)
Enter fullscreen mode Exit fullscreen mode

When fetching the rows in the order of the index, with Index Scan, I see one rocksdb_number_db_seek to go to the start of the index, to get the first row, and then 999 rocksdb_number_db_next to get the remaining rows. A next call goes to the next document in the RocksDB database, and is less expensive than a seek call that has to find the right document, from its key value.

When fetching rows in reverse order of the index, with Index Scan Backward, it must seek to the the key, that records column values and MVCC versions, and then go forward to build the row. I can two seek calls per row here, in addition to one next per row.

To get an idea of the CPU cost of it, I ran the backward scan in a loop while taking flamegraph of it:

yum install -y perf git wget
perf record --call-graph fp -F99 -e cpu-cycles -p $(pgrep -d, yb-tserver) sleep 120
git clone https://github.com/brendangregg/FlameGraph.git
wget -c https://raw.githubusercontent.com/FranckPachot/ybio/main/util_flamegraph/palette.map
perf script -i perf.data | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl --colors green --width=1200 --hash --cp | tee perf.svg && chmod a+r perf.svg
echo $(curl -L --upload-file perf.svg http://transfer.sh/$(date +%Y%m%dT%H%M%S).svg)

Enter fullscreen mode Exit fullscreen mode

You can get look at the .svg and here is a screenshot:
flamegraph
This shows two important call stacks from "To Next Desired Row":

  • yb::docdb::DocRowwiseIterator::HasNext() > yb::docdb::DocRowwiseIterator::AdvanceIteratorToNextDesiredRow > yb::docdb::IntentAwareIterator::PreparePrev > yb::docdb::PerformRocksDBSeek
  • yb::docdb::DocRowwiseIterator::HasNext() > yb::docdb::DocRowwiseIterator::AdvanceIteratorToNextDesiredRow > yb::docdb::IntentAwareIterator::SeekToLatestDocKeyInternal

Note that, if you cannot run perf I show another way to get the call stacks from the tserver endpoint in: https://dev.to/yugabyte/quickly-get-short-stack-from-a-yugabytedb-tserver-2j03

The rows are stored in a document, with one sub-document per column (I simplified here with one column per row) but can accumulate multiple versions, so the timestamp is also part of the key. The structure is optimized for forward scans, with the current version first. To go to the previous SQL key, it needs to seek to the first record of the current key, go to the previous record to know the previous key, and then seek to the first record of it to read the row.

When creating a range sharded primary key or secondary index, it is recommended to think immediately about the natural order of it, for queries that has to retrieve lot of rows in this order, to avoid less efficient backward scans. But what if there are queries ordering in both way? The difference may not justify adding more indexes. Think first about critical queries which read lot of rows in a specific order, and maybe the backward scan will be acceptable for other queries reading less rows. Note that the query planner may also decide to use a forward scan and sort it later. I'll show that in a next post.

YugabyteDB has many optimizations pushed down to the storage layer. If you query only few rows out of many, the Index Scan Backward will minimize the seek operations:

yugabyte=# execute snap_reset;

 ybwr metrics
--------------
(0 rows)

yugabyte=# explain analyze select id from demo where id in (100,200,300) order by id DESC;

                                                       QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Index Scan Backward using demo_pkey on demo  (cost=0.00..4.12 rows=1 width=4) (actual time=1.126..1.136 rows=3 loops=1)
   Index Cond: (id = ANY ('{100,200,300}'::integer[]))
 Planning Time: 0.136 ms
 Execution Time: 1.201 ms
(4 rows)

yugabyte=# execute snap_table;

                     row_name                      | rocksdb_#_db_seek | rocksdb_#_db_next
---------------------------------------------------+-------------------+-------------------
 yugabyte demo yb b701907f833d402fbe68835cdd7fc2bb |                 6 |                 3
(1 row)
Enter fullscreen mode Exit fullscreen mode

There is still two seeks to go backward, but only for the rows that need to be accessed. This means that the importance of ASC or DESC is mostly when fetching lot of rows from the Index Scan.

You can detect this by looking at explain (analyze) output. The indexes where the order may not be the optimal one are those with an Index Scan Backward having a high number of rows in (actual ... rows= or Rows removed by filter)

I've written this post after a little test I did in last week Twitch session - the record is here: https://youtu.be/tiR1_Om_To4?t=3309

YugabyteDB 2.19 is a bit faster

This blog post was with version 2.13 and I've run it in 2.19 where some optimizations were implemented. First, if you want to do the same, you should add --advertise_address=0.0.0.0 because we don't listen on all interfaces by default. My YBWR output for the Index Scan Backward shows now:

yugabyte=# explain analyze select id from demo order by id DESC;
                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Index Scan Backward using demo_pkey on demo  (cost=0.00..124.00 rows=1000 width=4) (actual time=15.759..16.254 rows=1000 loops=1)
 Planning Time: 4.022 ms
 Execution Time: 16.490 ms
 Peak Memory Usage: 0 kB
(4 rows)

yugabyte=# execute snap_table;
 rocksdb_seek | rocksdb_next | rocksdb_insert |       dbname / relname / tserver / tabletid / leader
--------------+--------------+----------------+-------------------------------------------------------------
         2002 |              |                | yugabyte demo aa7a012a753d4bc19144e190ffc958b8 L 172.17.0.2
(1 row)

yugabyte=# select version();
                                                                                         version
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 11.2-YB-2.19.0.0-b0 on x86_64-pc-linux-gnu, compiled by clang version 15.0.7 (https://github.com/yugabyte/llvm-project.git 6b9d30d80f5cebc66c9fa46013d18f125ea8ec0e), 64-bit
(1 row)
Enter fullscreen mode Exit fullscreen mode

I still see the same number of seek(), two per index entry, but no additional next() thanks to packed rows. Even if the difference has been reduced, it is till important to define the right ASC / DESC order for the critical use cases that scan many rows.

Top comments (0)