I did not have the task [jumping onto a random page] since I needed to scan the whole table. But I think that the solution would work in the following way:
You define the range of page numbers that should be rendered, like 10-20.
You define the page size. For example, 20.
You scan from the 1st page to the 10th to find user_id of the 10th page - the first one that is interesting for you. You use your page size (20) to make it (in the blog post page size of 10 000 is used).
For each page from 11th to 19th you're interested in the user_id that starts a page.
As you can see, the approach should work in production, but providing a link to a page with a random number (say, 1234th) requires scanning from the very first page every time. If the dataset is immutable, we can try to use caching.
If you're interested, I can practically test the solution or any other suggested one on my dataset for the next blog post in the series about SQL <3. It can be not very bad for the first hundreds of pages and depends on the size of a page.
I like the caching approach, in any case.
It would further improve performance, since the user reveals their intentions when they submit the first query. Then, the backend partitions the result into pages. (we actually considered this caching technique here!).
I didn't put much thought on dataset immutability, though. Indeed, it seems to be an important factor.
What do you mean by caching though? Caching the user_idboundaries (i.e. the first for every page), or caching all the pages?
We go thru the dataset from the very first record and split it into pages with a predefined size. The goal of the process is to have user_id boundaries as you mentioned before. Example for a page size of 10 000:
page 1, user_ids: 1 - 13 122
page 2, user_ids: 13 125 - 23 421
page 3, user_ids: 23 423 - 35 008
page 4, ...
You may notice that the difference between user_ids in boundaries is more than 10 000 - this is because we can have gaps (deleted users).
When you need to render all users for a page #3, you look into the cache and user the user_id boundaries for the purpose.
But when a user on page #3 is deleted - you need to recalculate the cache for all pages after this one. There is no reason to do that for pages #1 and #2 since boundaries for the users there are unchanged.
Sorry for the delay with the response this time.
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
Hey, Argyrios
Thank you for the good words!
I did not have the task [jumping onto a random page] since I needed to scan the whole table. But I think that the solution would work in the following way:
user_id
of the 10th page - the first one that is interesting for you. You use your page size (20) to make it (in the blog post page size of 10 000 is used).user_id
that starts a page.As you can see, the approach should work in production, but providing a link to a page with a random number (say, 1234th) requires scanning from the very first page every time. If the dataset is immutable, we can try to use caching.
If you're interested, I can practically test the solution or any other suggested one on my dataset for the next blog post in the series about SQL <3. It can be not very bad for the first hundreds of pages and depends on the size of a page.
Viach,
Thanks for the prompt reply!
I like the caching approach, in any case.
It would further improve performance, since the user reveals their intentions when they submit the first query. Then, the backend partitions the result into pages. (we actually considered this caching technique here!).
I didn't put much thought on dataset immutability, though. Indeed, it seems to be an important factor.
What do you mean by caching though? Caching the
user_id
boundaries (i.e. the first for every page), or caching all the pages?I can think about the following approach:
We go thru the dataset from the very first record and split it into pages with a predefined size. The goal of the process is to have
user_id
boundaries as you mentioned before. Example for a page size of 10 000:You may notice that the difference between
user_ids
in boundaries is more than 10 000 - this is because we can have gaps (deleted users).When you need to render all users for a page #3, you look into the cache and user the
user_id
boundaries for the purpose.But when a user on page #3 is deleted - you need to recalculate the cache for all pages after this one. There is no reason to do that for pages #1 and #2 since boundaries for the users there are unchanged.
Sorry for the delay with the response this time.