In a previous post discussing joins and pagination I was reading the first 10000 rows on a large table with some filtering. In YugabyteDB this was executed in 20 read requests, visible as loops=1 ... Read Requests: 10
for the outer table "one"
and loops=10 Read Requests: 1
for the inner table "many"
:
explain (costs off, analyze, dist)
/*+ Set( yb_bnl_batch_size 1024 ) */
select one.category, one.created_at, many.value
from (
select * from one
where one.category='π'
order by one.created_at desc
fetch first 10000 rows only
) one
left outer join many using(one_id)
order by one.created_at desc
;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Sort (actual time=55.449..56.315 rows=10000 loops=1)
Sort Key: one.created_at DESC
Sort Method: quicksort Memory: 853kB
-> YB Batched Nested Loop Left Join (actual time=6.767..51.799 rows=10000 loops=1)
Join Filter: (one.one_id = many.one_id)
-> Limit (actual time=1.567..8.961 rows=10000 loops=1)
-> Index Only Scan using one_category_hash_created_desc_id on one (actual time=1.565..6.822 rows=10000 loops=1)
Index Cond: (category = 'π'::text)
Heap Fetches: 0
Storage Index Read Requests: 10
Storage Index Read Execution Time: 1.420 ms
-> Index Only Scan using many_one_asc on many (actual time=3.179..3.179 rows=0 loops=10)
Index Cond: (one_id = ANY (ARRAY[one.one_id, $1, $2, ..., $1023]))
Heap Fetches: 0
Storage Index Read Requests: 1
Storage Index Read Execution Time: 2.251 ms
I want now to read the next page, with the same performance.
Bad solution: OFFSET
To read the next page I may be tempted to simply add offset 10000
:
explain (costs off, analyze, dist)
/*+ Set( yb_bnl_batch_size 1024 ) */
select one.category, one.created_at, many.value
from (
select * from one
where one.category='π'
order by one.created_at desc
offset 10000 fetch first 10000 rows only
) one
left outer join many using(one_id)
order by one.created_at desc
;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Sort (actual time=64.970..65.990 rows=10000 loops=1)
Sort Key: one.created_at DESC
Sort Method: quicksort Memory: 853kB
-> YB Batched Nested Loop Left Join (actual time=18.728..61.328 rows=10000 loops=1)
Join Filter: (one.one_id = many.one_id)
-> Limit (actual time=13.184..21.144 rows=10000 loops=1)
-> Index Only Scan using one_category_hash_created_desc_id on one (actual time=1.541..17.279 rows=20000 loops=1)
Index Cond: (category = 'π'::text)
Heap Fetches: 0
Storage Index Read Requests: 20
Storage Index Read Execution Time: 7.081 ms
-> Index Only Scan using many_one_asc on many (actual time=2.916..2.916 rows=0 loops=10)
Index Cond: (one_id = ANY (ARRAY[one.one_id, $1, $2, ..., $1023]))
Heap Fetches: 0
Storage Index Read Requests: 1
Storage Index Read Execution Time: 2.022 ms
This is not efficient. OFFSET is never a good idea in SQL because it has to read all rows (Index Only Scan ... rows=20000
) and discard the rows below the offset to reduce the result (Limit ... rows=20000
).
Good solution: start with unique last value
The correct way is to keep track of the last row that was read and start from it. Here, because I may have multiple rows for the same created_at
and the limit may have stopped in the middle, I need an additional unique column.
One solution is to use the unique key. I already have one_id
in the index:
yugabyte=# \d one
Table "public.one"
Column | Type | Collation | Nullable | Default
------------+--------------------------+-----------+----------+-------------------
one_id | uuid | | not null | gen_random_uuid()
category | text | | |
created_at | timestamp with time zone | | | clock_timestamp()
xxx | integer | | |
Indexes:
"one_pkey" PRIMARY KEY, lsm (one_id HASH)
"one_category_hash_created_desc_id" lsm (category HASH, created_at DESC, one_id ASC)
Referenced by:
TABLE "many" CONSTRAINT "many_one_id_fkey" FOREIGN KEY (one_id) REFERENCES one(one_id)
Then I change the ORDER BY accordingly (order by one.created_at desc, one.one_id asc
) and return one_id
, in addition to created_at
, to the application. For the pagination, I add the where clause that matches the ORDER BY caluse ( one.created_at < $1::timestamptz or one_id > $2::uuid )
so that the application run it with the last (created_at
,one_id
) as ($1,$2). Unfortunately, this OR condition cannot use the index and then I add a redundant predicate: one.created_at <= $1::timestamptz
. If I want to use the same query for the first and next pages, an additional clause will check if $1
and $2
are null:
yugabyte=# '2023-12-07 20:56:15.3813+00' '6b7dcda2-a1de-4578-8252-fe86b44f6cbc'
yugabyte=# explain (costs off, analyze, dist)
/*+ Set( yb_bnl_batch_size 1024 ) */
select one.category, one.created_at, many.value, one.one_id
from (
select * from one
where one.category='π'
and (
-- first page starts at the begining of the index
( $1::timestamptz is null and $2::uuid is null )
or
(
-- next page start for the index scan
one.created_at <= $1::timestamptz
-- precise row to start
and (
one.created_at < $1::timestamptz
or (one.created_at = $1::timestamptz and one_id > $2::uuid )
)
)
)
order by one.created_at desc, one.one_id asc
fetch first 10000 rows only
) one
left outer join many using(one_id)
order by one.created_at desc
;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Sort (actual time=53.851..55.106 rows=10000 loops=1)
Sort Key: one.created_at DESC
Sort Method: quicksort Memory: 1166kB
-> YB Batched Nested Loop Left Join (actual time=6.434..49.937 rows=10000 loops=1)
Join Filter: (one.one_id = many.one_id)
-> Limit (actual time=1.897..9.530 rows=10000 loops=1)
-> Index Only Scan using one_category_hash_created_desc_id on one (actual time=1.895..7.235 rows=10000 loops=1)
Index Cond: ((category = 'π'::text) AND (created_at <= '2023-12-07 20:56:15.20535+00'::timestamp with time zone))
Heap Fetches: 0
Storage Index Read Requests: 10
Storage Index Read Execution Time: 1.739 ms
-> Index Only Scan using many_one_asc on many (actual time=2.944..2.944 rows=0 loops=10)
Index Cond: (one_id = ANY (ARRAY[one.one_id, $1, $2, ..., $1023]))
Heap Fetches: 0
Storage Index Read Requests: 1
Storage Index Read Execution Time: 2.104 ms
Planning Time: 0.807 ms
Execution Time: 57.239 ms
This is more complex to write than OFFSET but also gives more control and is optimized to read only what is necessary.
I've already written about this in the past:
Efficient pagination in YugabyteDB & PostgreSQL πππ
Franck Pachot for YugabyteDB Distributed PostgreSQL Database γ» Mar 18 '22
Alternative solution: last value + relative offset
Without adding a unique column from the table, you can create one using a window function that has the same ordering as the ORDER BY FETCH FIRST ROWS. Here's an example: row_number() over(partition by created_at order by created_at desc)
. This way, there no additional sort needed (check it in the execution plan) and you have numbered each duplicate row. However, in my opinion, all tables should have a unique key, and adding it to the end of the index shouldn't be an issue.
Top comments (0)