Franck Pachot for YugabyteDB

Posted on Aug 5, 2024 • Edited on Aug 6, 2024

Performance of range queries in B-Tree and LSM indexes

#postgres #database #yugabytedb #sql

Range queries are efficient with B-Tree indexes because they can quickly locate the starting value with limited random reads from the root to the first leaf through the branches. Then, they can read many rows from the same block and efficiently move to the next block. B+Tree indexes maintain a 2-way linked list between leaves, so going to the next leaf only requires a random read and doesn't have to go through the B-Tree root and branches again.

With LSM tree (Log-Structured Merge-tree) indexes, the iterator locates the start value with a seek operation and then reads the subsequent rows with the next operation. These are contiguous as the SST files are maintained sorted based on the key by flush or compaction, producing good read performance when the number of SST files is not too high.

Let's compare how many bytes are read for a large range query, using a B-Tree on PostgreSQL and an LSM Tree on YugabyteDB, with the following table:

create table demo (id bigserial primary key, value timestamptz);
create index demo_value on demo ( value desc, id );
insert into demo(value) select clock_timestamp() from generate_series(1,10000)
\watch c=100 i=0.01

I'll run a range query that starts at the current date, and reads the last fifty thousand rows ordered by their time:

select * from demo 
 where value < now()
 order by value desc limit 50000;

I have created the best index for this query: a descending index on "value" that contains all other columns in the index entry to allow an Index-Only Scan.

To display the performance metrics, I will use explain analyze with specific options for PostgreSQL (such as buffers to count the number of pages read from the shared buffers) and YugabyteDB (such as dist, debug to display the remote read requests and the RocksDB metrics, as YugabyteDB's LSM Tree implementation is based on RocksDB).

Explain Analyze

I use the YugabyteDB docker image to start a lab with YugabyteDB and PostgreSQL:

docker run --rm -it yugabytedb/yugabyte bash

# Start YugabyteDB
yugabyted start --advertise_address=127.0.0.1

# Install and start PostgreSQL
dnf update -y
dnf module enable -y postgresql:16
dnf module install -y postgresql:16
dnf install -y postgresql-contrib
su - postgres -c "pg_ctl init"
su - postgres -c "pg_ctl start"

I can connect to PostgreSQL with psql postgres://postgres@127.0.0.1:5432 and to YugabyteDB with psql postgres://yugabyte@127.0.0.1:5433

In psql, I can run the same script with DDL and DML to create and fill "demo" table in PostgreSQL and YugabyteDB:

create table demo (id bigserial primary key, value timestamptz);
create index demo_value on demo ( value desc, id );
insert into demo(value) 
 select clock_timestamp() from generate_series(1,10000)
\watch c=100 i=0.01

PostgreSQL B+Tree index

I execute my range query with explain analyze and the buffers option:

postgres=# explain (analyze, buffers, costs off, summary off)
           select * from demo
            where value < now()
            order by value desc limit 50000;

                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Limit (actual time=0.067..18.932 rows=50000 loops=1)
   Buffers: shared hit=616
   ->  Index Only Scan using demo_value on demo (actual time=0.065..14.769 rows=50000 loops=1)
         Index Cond: (value < now())
         Heap Fetches: 50000
         Buffers: shared hit=616
 Planning:
   Buffers: shared hit=20 read=9
(8 rows)

The range query reads 616 buffers of 8KB, so about 4.8 MB was read. Depending on the size of memory allocated and the frequency of reading those blocks, they can be served from the PostgreSQL shared buffers, the Linux kernel buffers, or the disk. I'm looking at the logical reads without considering any cache misses.

Running the same query after a while or after vacuum demo reads fewer blocks:

postgres=# vacuum demo;
VACUUM
postgres=# explain (analyze, buffers, costs off, summary off)
           select * from demo
            where value < now()
            order by value desc limit 50000;

                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Limit (actual time=0.036..11.938 rows=50000 loops=1)
   Buffers: shared hit=346
   ->  Index Only Scan using demo_value on demo (actual time=0.035..7.820 rows=50000 loops=1)
         Index Cond: (value < now())
         Heap Fetches: 0
         Buffers: shared hit=346
 Planning:
   Buffers: shared hit=4
(8 rows)

On freshly vacuumed tables with no inserts since the last vacuum run, this query reads 346 BTree blocks to get 50000 rows, which is about 2.7 MB.

YugabyteDB

YugabyteDB behaves like PostgreSQL but doesn't use shared buffers because the query layer is stateless and distributes the reads and writes to multiple nodes. The metric options are dist to see the remote calls and debug that will show the LSM Tree metrics (based on RocksDB):

yugabyte=# explain (analyze, dist, debug, costs off, summary off)
           select * from demo
            where value < now()
            order by value desc limit 50000;

                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Limit (actual time=1.321..35.592 rows=50000 loops=1)
   ->  Index Only Scan using demo_value on demo (actual time=1.319..24.314 rows=50000 loops=1)
         Index Cond: (value < now())
         Heap Fetches: 0
         Storage Index Read Requests: 49
         Storage Index Read Execution Time: 2.449 ms
         Storage Index Rows Scanned: 50176
         Metric rocksdb_block_cache_miss: 2.000
         Metric rocksdb_block_cache_hit: 292.000
         Metric rocksdb_block_cache_add: 2.000
         Metric rocksdb_block_cache_index_hit: 98.000
         Metric rocksdb_block_cache_data_miss: 2.000
         Metric rocksdb_block_cache_data_hit: 96.000
         Metric rocksdb_block_cache_bytes_read: 3203738.000
         Metric rocksdb_block_cache_bytes_write: 65395.000
         Metric rocksdb_number_db_seek: 49.000
         Metric rocksdb_number_db_next: 50225.000
         Metric rocksdb_number_db_seek_found: 49.000
         Metric rocksdb_number_db_next_found: 50225.000
         Metric rocksdb_iter_bytes_read: 3811018.000
         Metric rocksdb_block_cache_single_touch_hit: 96.000
         Metric rocksdb_block_cache_single_touch_add: 2.000
         Metric rocksdb_block_cache_single_touch_bytes_read: 3138960.000
         Metric rocksdb_block_cache_single_touch_bytes_write: 65395.000
         Metric rocksdb_block_cache_multi_touch_hit: 196.000
         Metric rocksdb_block_cache_multi_touch_bytes_read: 64778.000
         Metric docdb_keys_found: 50225.000
         Metric rocksdb_read_block_get_micros: sum: 93.000, count: 2.000
         Metric rocksdb_sst_read_micros: sum: 29.000, count: 2.000
         Metric ql_read_latency: sum: 22097.000, count: 49.000
(30 rows)

The rocksDB implementation is more complex than a B+Tree to be optimized for modern hardware, including CPUs with large caches, multi-threading, and SSD storage. Each SST file stores data in 32k blocks and contains additional per-file metadata for discarding or finding keys, with bloom filters and a binary index for data. Additional optimizations include data compression, with blocks being uncompressed when in the block cache, and prefix compression to store only the differing part of the key from the previous one.

The important metric here is rocksdb_iter_bytes_read: 3811018, which means that we have read 3.6MB to get the 50000 rows of the result. Those are logical reads from the YugabyteDB block cache (uncompressed), the Linux kernel buffers (compressed), or the storage.

There's no need for a vacuum in YugabyteDB, but I rerun it to get the statistics with a warm cache:

yugabyte=# explain (analyze, dist, debug, costs off, summary off)
select * from demo
 where value < now()
 order by value desc limit 50000;
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Limit (actual time=1.501..32.410 rows=50000 loops=1)
   ->  Index Only Scan using demo_value on demo (actual time=1.498..22.717 rows=50000 loops=1)
         Index Cond: (value < now())
         Heap Fetches: 0
         Storage Index Read Requests: 49
         Storage Index Read Execution Time: 5.533 ms
         Storage Index Rows Scanned: 50176
         Metric rocksdb_block_cache_hit: 294.000
         Metric rocksdb_block_cache_index_hit: 98.000
         Metric rocksdb_block_cache_data_hit: 98.000
         Metric rocksdb_block_cache_bytes_read: 3269133.000
         Metric rocksdb_number_db_seek: 49.000
         Metric rocksdb_number_db_next: 50225.000
         Metric rocksdb_number_db_seek_found: 49.000
         Metric rocksdb_number_db_next_found: 50225.000
         Metric rocksdb_iter_bytes_read: 3811018.000
         Metric rocksdb_block_cache_single_touch_hit: 98.000
         Metric rocksdb_block_cache_single_touch_bytes_read: 3204355.000
         Metric rocksdb_block_cache_multi_touch_hit: 196.000
         Metric rocksdb_block_cache_multi_touch_bytes_read: 64778.000
         Metric docdb_keys_found: 50225.000
         Metric ql_read_latency: sum: 22811.000, count: 49.000
(22 rows)

50000 rows were read in 49 read requests, and each read request has read two filter blocks, two index blocks, and two data blocks, for a total of 294 blocks and 3269133 bytes, which is about 3MB. The data block size in YugabyteDB is 32KB.

3.6 MB have been read (rocksdb_iter_bytes_read: 3811018.) and 3.1MB were read from the cache (rocksdb_block_cache_bytes_read: 3269133) in 98 reads (rocksdb_block_cache_data_hit: 98) as the block cache uses 32KB blocks for data (filter and index are in 64KB block size).

Those 49 read requests are due to the default fetch size of 1024 rows per remote call. If I increase the fetch size to get all rows in one call (I do this for the lab observation - it's not a recommendation), I see one read request and six block cache reads:

yugabyte=# set yb_fetch_row_limit=50000;
SET
yugabyte=# explain (analyze, dist, debug, costs off, summary off)
select * from demo
 where value < now()
 order by value desc limit 50000;
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Limit (actual time=25.719..50.862 rows=50000 loops=1)
   ->  Index Only Scan using demo_value on demo (actual time=25.715..41.415 rows=50000 loops=1)
         Index Cond: (value < now())
         Heap Fetches: 0
         Storage Index Read Requests: 1
         Storage Index Read Execution Time: 23.890 ms
         Storage Index Rows Scanned: 50000
         Metric rocksdb_block_cache_hit: 6.000
         Metric rocksdb_block_cache_index_hit: 2.000
         Metric rocksdb_block_cache_data_hit: 2.000
         Metric rocksdb_block_cache_bytes_read: 66717.000
         Metric rocksdb_number_db_seek: 1.000
         Metric rocksdb_number_db_next: 50001.000
         Metric rocksdb_number_db_seek_found: 1.000
         Metric rocksdb_number_db_next_found: 50001.000
         Metric rocksdb_iter_bytes_read: 3790745.000
         Metric rocksdb_block_cache_multi_touch_hit: 6.000
         Metric rocksdb_block_cache_multi_touch_bytes_read: 66717.000
         Metric docdb_keys_found: 50001.000
         Metric ql_read_latency: sum: 21672.000, count: 1.000
(20 rows)

The iterator has read a total of 3.6MB (3790745 Metric rocksdb_iter_bytes_read) like before, but this amount was not read through the block cache, except for two blocks (rocksdb_block_cache_bytes_read: 66717). Consequently, the index read execution time is higher at 23.890 ms compared to 5.533 ms when reading in multiple read requests because system calls to the Linux kernel now serve them. I think the reason, in this case, is an optimization to ensure that a large scan does not wipe out the block cache.

Reading 50000 rows out of 1000000 is 5% of the table. Here are the statistics for the secondary index's LSM Tree (visible from the tablet server's port 9000), which in my case, is a single tablet and single SST file (I forced a full compaction with yb-ts-cli --server-address=localhost:9100 compact_all_tablets):

The uncompressed size (uncompressed_size: 77160443) is 74MB for one million index entries (entries=1000000), which matches 3.6GB for the fifty thousand rows I've read. The compressed size is 26MB (total_size: 27497615) and includes a base size of filter blocks (filter blocks=19, filter blocks total size=1244158) and index blocks (data index blocks=3, data index size=65547). The compression factor is 2.8 here and can be more significant with text data.

The index entries are, on average, 76 bytes for the key and value (raw average key size=61.055827 + raw average value size=14.8208). This means we have 58MB of keys and 14MB of values, typical for an index without 'INCLUDE' columns, and a non-unique index, which includes the primary key pointer in its key to make the LSM Tree key unique.

The execution time for the range query depends heavily on the caches. I see 12 milliseconds for PostgreSQL and 32 milliseconds for YugabyteDB, with default fetch size and five milliseconds spent on the read request. To compare the time, I'll use PgBench to run the query from multiple connections.

PgBench

I run the query from ten connections and for two minutes:

echo 'select * from demo where value < now() order by value desc limit 50000' |
pgbench -c 10 -t120 -nrf /dev/stdin postgres://postgres@127.0.0.1:5432
echo 'select * from demo where value < now() order by value desc limit 50000' |
pgbench -c 10 -t120 -nrf /dev/stdin postgres://yugabyte@127.0.0.1:5433

Here is the result on PostgreSQL:

pgbench (16.1)
transaction type: /dev/stdin
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
maximum number of tries: 1
number of transactions per client: 120
number of transactions actually processed: 1200/1200
number of failed transactions: 0 (0.000%)
latency average = 81.975 ms
initial connection time = 34.548 ms
tps = 121.987816 (without initial connection time)
statement latencies in milliseconds and failures:
        81.617           0  select * from demo where value < now() order by value desc limit 50000

Here is the result on YugabyteDB:

pgbench (16.1, server 11.2-YB-2.21.1.0-b0)
transaction type: /dev/stdin
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
maximum number of tries: 1
number of transactions per client: 120
number of transactions actually processed: 1200/1200
number of failed transactions: 0 (0.000%)
latency average = 179.996 ms
initial connection time = 198.517 ms
tps = 55.556916 (without initial connection time)
statement latencies in milliseconds and failures:
       178.840           0  select * from demo where value < now() order by value desc limit 50000

The latency and throughput performance for the large-range query in this example show that PostgreSQL is 2.2 times faster. This is mainly due to YugabyteDB's distributed architecture and the fact that PostgreSQL is vacuumed, which is an ideal case that doesn't happen in production with ongoing writes.

Even if a bit slower, the response time in YugabyteDB is more predictable as it doesn't depend on vacuum, and the throughput can increase with horizontal scalability. Regarding the index range scan, we have seen similar numbers between B+Tree and LSM Tree, even with very different implementations. Both store index entries sorted in blocks (8KB for PostgreSQL, 32KB for YugabyteDB). They are similar in size to read through B+tree pointers between leaves or sequentially in the LSM tree SST file. Both have a small structure to find the blocks for the range to scan: root and branches for B-Tree, SST file filter, and index for LSM Tree.

Another big difference is that while PostgreSQL uses the same shared buffers for reads and writes, YugabyteDB uses different memory structures. The block cache we have seen here is for reads only from the immutable SST files. The writes go through the MemTable and are flushed regularly to a new SST file. This has the advantage that a read doesn't have to wait for a free buffer, which may depend on the write and checkpoint activity in shared buffer pool databases.

DEV Community

Performance of range queries in B-Tree and LSM indexes

Explain Analyze

PostgreSQL B+Tree index

YugabyteDB

PgBench

Top comments (0)