Partitions, Merge Append, Pagination, and Limit pushdown in YugabyteDB

#yugabytedb #postgres #distributed #sql

When using Top-N queries for pagination, you usually include an ORDER BY clause and a LIMIT or FETCH FIRST ROWS clause. The goal is to avoid scanning a large number of rows, sorting them, and then returning only the first few. It's crucial to have an index that returns the rows in the desired order. In PostgreSQL declarative partitioning, the indexes are local, but the order is maintained using a "Merge Append" and then the LIMIT is optimized.

Here is an example:

create table demo ( id bigint, x int, y float ) partition by range(id);
create table demo0 partition of demo for values from (00) to (10);
create table demo1 partition of demo for values from (10) to (20);
create table demo2 partition of demo for values from (20) to (30);
create table demo3 partition of demo for values from (30) to (40);
insert into demo select generate_series(0,39),100*random(),random();
\watch c=1000 i=0.01

I'll run the following query:

select * from demo where x=1
order by x,y desc limit 10
;

If I create an index only for the WHERE clause

yugabyte=# create index demo_bad on demo(x);
CREATE INDEX

My query reads a hundred rows from each partition and has to sort them to get the Top-10 (Sort Method: top-N heapsort).

yugabyte=# explain (analyze, dist)
select * from demo where x=1
order by x,y desc limit 10
;
                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=21.96..21.99 rows=10 width=20) (actual time=9.165..9.171 rows=10 loops=1)
   ->  Sort  (cost=21.96..22.06 rows=40 width=20) (actual time=9.163..9.164 rows=10 loops=1)
         Sort Key: demo0.y DESC
         Sort Method: top-N heapsort  Memory: 26kB
         ->  Append  (cost=0.00..21.10 rows=40 width=20) (actual time=2.748..9.032 rows=393 loops=1)
               ->  Index Scan using demo0_x_idx on demo0  (cost=0.00..5.22 rows=10 width=20) (actual time=2.745..2.779 rows=101 loops=1)
                     Index Cond: (x = 1)
                     Storage Table Read Requests: 1
                     Storage Table Read Execution Time: 1.458 ms
                     Storage Table Rows Scanned: 101
                     Storage Index Read Requests: 1
                     Storage Index Read Execution Time: 1.072 ms
                     Storage Index Rows Scanned: 101
               ->  Index Scan using demo1_x_idx on demo1  (cost=0.00..5.22 rows=10 width=20) (actual time=2.236..2.266 rows=94 loops=1)
                     Index Cond: (x = 1)
                     Storage Table Read Requests: 1
                     Storage Table Read Execution Time: 1.224 ms
                     Storage Table Rows Scanned: 94
                     Storage Index Read Requests: 1
                     Storage Index Read Execution Time: 0.820 ms
                     Storage Index Rows Scanned: 94
               ->  Index Scan using demo2_x_idx on demo2  (cost=0.00..5.22 rows=10 width=20) (actual time=1.806..1.835 rows=93 loops=1)
                     Index Cond: (x = 1)
                     Storage Table Read Requests: 1
                     Storage Table Read Execution Time: 1.118 ms
                     Storage Table Rows Scanned: 93
                     Storage Index Read Requests: 1
                     Storage Index Read Execution Time: 0.528 ms
                     Storage Index Rows Scanned: 93
               ->  Index Scan using demo3_x_idx on demo3  (cost=0.00..5.22 rows=10 width=20) (actual time=1.997..2.033 rows=105 loops=1)
                     Index Cond: (x = 1)
                     Storage Table Read Requests: 1
                     Storage Table Read Execution Time: 1.168 ms
                     Storage Table Rows Scanned: 105
                     Storage Index Read Requests: 1
                     Storage Index Read Execution Time: 0.665 ms
                     Storage Index Rows Scanned: 105
 Planning Time: 0.175 ms
 Execution Time: 9.251 ms

This is fast in my case, but not scalable.
What I want is that the limit is pushed down, reading at maximum ten rows from each partition.

Let's create an index that returns the rows in the right order:

yugabyte=# drop index demo_bad;
DROP INDEX
yugabyte=# create index demo_good on demo(x,y desc);
CREATE INDEX

Let's run the query again:

yugabyte=# explain (analyze, dist)
select * from demo where x=1
order by x,y desc limit 10
;
                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.04..1.81 rows=10 width=20) (actual time=7.901..7.919 rows=10 loops=1)
   ->  Merge Append  (cost=0.04..71.04 rows=400 width=20) (actual time=7.899..7.912 rows=10 loops=1)
         Sort Key: demo0.y DESC
         ->  Index Scan using demo0_x_y_idx on demo0  (cost=0.00..16.25 rows=100 width=20) (actual time=2.823..2.826 rows=3 loops=1)
               Index Cond: (x = 1)
               Storage Table Read Requests: 1
               Storage Table Read Execution Time: 1.122 ms
               Storage Table Rows Scanned: 10
               Storage Index Read Requests: 1
               Storage Index Read Execution Time: 1.154 ms
               Storage Index Rows Scanned: 10
         ->  Index Scan using demo1_x_y_idx on demo1  (cost=0.00..16.25 rows=100 width=20) (actual time=1.741..1.744 rows=3 loops=1)
               Index Cond: (x = 1)
               Storage Table Read Requests: 1
               Storage Table Read Execution Time: 0.880 ms
               Storage Table Rows Scanned: 10
               Storage Index Read Requests: 1
               Storage Index Read Execution Time: 0.701 ms
               Storage Index Rows Scanned: 10
         ->  Index Scan using demo2_x_y_idx on demo2  (cost=0.00..16.25 rows=100 width=20) (actual time=1.818..1.819 rows=2 loops=1)
               Index Cond: (x = 1)
               Storage Table Read Requests: 1
               Storage Table Read Execution Time: 1.087 ms
               Storage Table Rows Scanned: 10
               Storage Index Read Requests: 1
               Storage Index Read Execution Time: 0.593 ms
               Storage Index Rows Scanned: 10
         ->  Index Scan using demo3_x_y_idx on demo3  (cost=0.00..16.25 rows=100 width=20) (actual time=1.507..1.509 rows=5 loops=1)
               Index Cond: (x = 1)
               Storage Table Read Requests: 1
               Storage Table Read Execution Time: 0.957 ms
               Storage Table Rows Scanned: 10
               Storage Index Read Requests: 1
               Storage Index Read Execution Time: 0.387 ms
               Storage Index Rows Scanned: 10
 Planning Time: 1.350 ms
 Execution Time: 8.000 ms
 Storage Read Requests: 8
 Storage Read Execution Time: 6.882 ms
 Storage Rows Scanned: 80

There's no Sort operation but a Merge Append that reads from each partition branch and preserves the order. This allows YugabyteDB to push down the LIMIT and read ten rows from each branch.

This reads 40 rows in total, 10 rows from 4 partitions (you see 80 because there are 40 index entries and 40 table rows) because we don't know in advance where the Top-10 will come from and it is better to read them by batch.

To understand better I can set the fetch size to its minimum:

yugabyte=# set yb_fetch_row_limit to 1;
SET

yugabyte=# explain (analyze, dist)
select * from demo where x=1
order by x,y desc limit 10
;
                                                              QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.04..1.81 rows=10 width=20) (actual time=14.748..20.888 rows=10 loops=1)
   ->  Merge Append  (cost=0.04..71.04 rows=400 width=20) (actual time=14.746..20.877 rows=10 loops=1)
         Sort Key: demo0.y DESC
         ->  Index Scan using demo0_x_y_idx on demo0  (cost=0.00..16.25 rows=100 width=20) (actual time=10.187..11.525 rows=3 loops=1)
               Index Cond: (x = 1)
               Storage Table Read Requests: 3
               Storage Table Read Execution Time: 2.282 ms
               Storage Table Rows Scanned: 3
               Storage Index Read Requests: 3
               Storage Index Read Execution Time: 3.930 ms
               Storage Index Rows Scanned: 3
         ->  Index Scan using demo1_x_y_idx on demo1  (cost=0.00..16.25 rows=100 width=20) (actual time=1.674..2.752 rows=3 loops=1)
               Index Cond: (x = 1)
               Storage Table Read Requests: 3
               Storage Table Read Execution Time: 1.697 ms
               Storage Table Rows Scanned: 3
               Storage Index Read Requests: 3
               Storage Index Read Execution Time: 0.731 ms
               Storage Index Rows Scanned: 3
         ->  Index Scan using demo2_x_y_idx on demo2  (cost=0.00..16.25 rows=100 width=20) (actual time=1.580..2.352 rows=2 loops=1)
               Index Cond: (x = 1)
               Storage Table Read Requests: 2
               Storage Table Read Execution Time: 1.434 ms
               Storage Table Rows Scanned: 2
               Storage Index Read Requests: 2
               Storage Index Read Execution Time: 0.687 ms
               Storage Index Rows Scanned: 2
         ->  Index Scan using demo3_x_y_idx on demo3  (cost=0.00..16.25 rows=100 width=20) (actual time=1.294..4.218 rows=5 loops=1)
               Index Cond: (x = 1)
               Storage Table Read Requests: 5
               Storage Table Read Execution Time: 3.344 ms
               Storage Table Rows Scanned: 5
               Storage Index Read Requests: 5
               Storage Index Read Execution Time: 0.375 ms
               Storage Index Rows Scanned: 5
 Planning Time: 6.991 ms
 Execution Time: 22.635 ms
 Storage Read Requests: 26
 Storage Read Execution Time: 14.479 ms
 Storage Rows Scanned: 26

In my example, I retrieved between 2 and 5 rows from each partition to get a total of 10 rows. Although I read fewer rows, this resulted in more read requests, making the process less efficient. It's better to keep the default fetch size, which will be automatically adjusted to the LIMIT value.

DEV Community

Partitions, Merge Append, Pagination, and Limit pushdown in YugabyteDB

Top comments (0)