DEV Community

Cover image for The SQL I Love. Efficient pagination of a table with 100M records

The SQL I Love. Efficient pagination of a table with 100M records

Viach Kakovskyi on September 24, 2017

Originally published on my blog: All You Need Is Backend. I am a huge fan of databases. I even wanted to make my own DBMS when I was in universi...
Collapse
 
argysamo profile image
Argyrios Samourkasidis

Thanks Viach for the great report!
Performance difference among the three solutions is chaotic.

Regarding the keyset pagination:
It assumes that one we will iterate through the resulting pages sequentially.
I mean you can't retrieve the 50th page, without retrieving the 49th, etc.
Is that right?

In many cases though this is a negligible trade-off.

Collapse
 
backendandbbq profile image
Viach Kakovskyi

Hey, Argyrios

Thank you for the good words!

I did not have the task [jumping onto a random page] since I needed to scan the whole table. But I think that the solution would work in the following way:

  1. You define the range of page numbers that should be rendered, like 10-20.
  2. You define the page size. For example, 20.
  3. You scan from the 1st page to the 10th to find user_id of the 10th page - the first one that is interesting for you. You use your page size (20) to make it (in the blog post page size of 10 000 is used).
  4. For each page from 11th to 19th you're interested in the user_id that starts a page.

As you can see, the approach should work in production, but providing a link to a page with a random number (say, 1234th) requires scanning from the very first page every time. If the dataset is immutable, we can try to use caching.

If you're interested, I can practically test the solution or any other suggested one on my dataset for the next blog post in the series about SQL <3. It can be not very bad for the first hundreds of pages and depends on the size of a page.

Collapse
 
argysamo profile image
Argyrios Samourkasidis • Edited

Viach,
Thanks for the prompt reply!

I like the caching approach, in any case.
It would further improve performance, since the user reveals their intentions when they submit the first query. Then, the backend partitions the result into pages. (we actually considered this caching technique here!).

I didn't put much thought on dataset immutability, though. Indeed, it seems to be an important factor.

What do you mean by caching though? Caching the user_id boundaries (i.e. the first for every page), or caching all the pages?

Thread Thread
 
backendandbbq profile image
Viach Kakovskyi

I can think about the following approach:

We go thru the dataset from the very first record and split it into pages with a predefined size. The goal of the process is to have user_id boundaries as you mentioned before. Example for a page size of 10 000:

  • page 1, user_ids: 1 - 13 122
  • page 2, user_ids: 13 125 - 23 421
  • page 3, user_ids: 23 423 - 35 008
  • page 4, ...

You may notice that the difference between user_ids in boundaries is more than 10 000 - this is because we can have gaps (deleted users).

When you need to render all users for a page #3, you look into the cache and user the user_id boundaries for the purpose.

But when a user on page #3 is deleted - you need to recalculate the cache for all pages after this one. There is no reason to do that for pages #1 and #2 since boundaries for the users there are unchanged.

Sorry for the delay with the response this time.

Collapse
 
lluismf profile image
Lluís Josep Martínez

What's your method of scanning large tables?

We have no use cases where scanning large tables is needed. At least in the UI. We return the first page, then second page (skipping the first) and so on. End users end pagination after a few pages => they must refine the query.

For backend processes we just use a ResultSet, but there's no need to paginate, obviously.

Collapse
 
dev-alamin profile image
Al Amin

We cannot use traditional page number like 1,2,3,8,9 in keyset pagination. We only have to use 'Next' and 'Prev' button. Without this constrain, the keyset is the only way to make it faster when you have millions of data.

Collapse
 
bjacquemont profile image
Benoit Jacquemont

It's a very similar approach to the "Search After" feature of Elasticsearch (elastic.co/guide/en/elasticsearch/...), which has exactly the same problem with the "classical" offset/limit pagination.

By the way, using this method with MySQL too, for a WebAPI used to download all entities of a DB.

Thanks for documenting it!

Collapse
 
evilkittenlord profile image
EvilKittenLord

Depending on whether or not your ux SLAs will permit it, you can pre-gen all your pagination ranges and store them in a cache. If your ux can handle a 10s delay from post to publish, that gives a 10000ms temporal window in which a large portion of a pagination query can be cached. Cache each size of list display. Ex: Show: 100, 250, 500 items.
Cool?

Collapse
 
ochedru profile image
Olivier Chédru

Also, mandatory kudos to @MarkusWinand for use-the-index-luke.com/no-offset

Collapse
 
backendandbbq profile image
Viach Kakovskyi

Definitely, the whole Use The Index Luke is great.

Collapse
 
databasesponge profile image
MetaDave 🇪🇺

... with 100 000 000 records, the query is never finished. The DBMS just kills it. Why? Probably, because it led to the attempt to load the whole table into RAM. Before returning data to the client.

But you don't mean that the database would not return data to the client application before it had finished reading the data, do you? You might like to clarify exactly which client you mean, here.