Originally published on my blog: All You Need Is Backend.
I am a huge fan of databases. I even wanted to make my own DBMS when I was in university. Now I work both with RDBMS and NoSQL solutions, and I am very enthusiastic with that. You know, there's no Golden Hammer, each problem has own solution. Alternatively, a subset of solutions.
In the series of blog posts The SQL I Love <3 I walk you thru some problems solved with SQL which I found particularly interesting. The solutions are tested using a table with more than 100 million records. All the examples use MySQL, but ideas apply to other relational data stores like PostgreSQL, Oracle and SQL Server.
This Chapter is focused on efficient scanning a large table using pagination with offset
on the primary key. This is also known as keyset pagination.
Background
In the chapter, we use the following database structure for example. The canonical example about users should fit any domain.
CREATE TABLE `users` (
`user_id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`external_id` varchar(32) NOT NULL,
`name` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
`metadata` text COLLATE utf8_unicode_ci,
`date_created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`user_id`),
UNIQUE KEY `uf_uniq_external_id` (`external_id`),
UNIQUE KEY `uf_uniq_name` (`name`),
KEY `date_created` (`date_created`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
A few comments about the structure:
-
external_id
column stores reference to the same user in other system in UUID format -
name
representsFirstname Lastname
-
metadata
column contains JSON blob with all kinds of unstructured data
The table is relatively large and contains around 100 000 000 records. Let's start our learning journey.
Scanning a Large Table
Problem: You need to walk thru the table, extract each record, transform it inside your application's code and insert to another place. We focus on the first stage in the post - scanning the table.
Obvious and wrong solution
SELECT user_id, external_id, name, metadata, date_created
FROM users;
In my case with 100 000 000 records, the query is never finished. The DBMS just kills it. Why? Probably, because it led to the attempt to load the whole table into RAM. Before returning data to the client. Another assumption - it took too much time to pre-load the data before sending and the query was timed out.
Anyway, our attempt to get all records in time failed. We need to find some other solution.
Solution #2
We can try to get the data in pages. Since records are not guaranteed to be ordered in a table on physical or logical level - we need to sort them on the DBMS side with ORDER BY
clause.
SELECT user_id, external_id, name, metadata, date_created
FROM users
ORDER BY user_id ASC
LIMIT 0, 10 000;
10 000 rows in set (0.03 sec)
Sweet. It worked. We asked the first page of 10 000 records, and it took only 0.03
sec to return it. However, how it would work for the 5000th page?
SELECT user_id, external_id, name, metadata, date_created
FROM users
ORDER BY user_id ASC
LIMIT 50 000 000, 10 000; --- 5 000th page * 10 000 page size
10 000 rows in set (40.81 sec)
Indeed, this is very slow. Let's see how much time is needed to get the data for the latest page.
SELECT user_id, external_id, name, metadata, date_created
FROM users
ORDER BY user_id ASC
LIMIT 99 990 000, 10 000; --- 9999th page * 10 000 page size
10 000 rows in set (1 min 20.61 sec)
This is insane. However, can be OK for solutions that run in the background. One more hidden problem with the approach can be revealed if you try to delete a record from the table in the middle of scanning it. Say, you finished the 10th page (100 000 records are already visited), going to scan the records between 100 001 and 110 000. But records 99 998 and 99 999 are deleted before the next SELECT
execution. In that case, the following query returns the unexpected result:
SELECT user_id, external_id, name, metadata, date_created
FROM users
ORDER BY user_id ASC
LIMIT 100 000, 10 000;
N, id, ...
1, 100 003, ...
2, 100 004, ...
As you can see, the query skipped the records with ids 100 001 and 100 002. They will not be processed by application's code with the approach because after the two delete operations they appear in the first 100 000 records. Therefore, the method is unreliable if the dataset is mutable.
Solution #3 - the final one for today
The approach is very similar to the previous one because it still uses paging, but now instead of relying on the number of scanned records, we use the user_id
of the latest visited record as the offset
.
Simplified algorithm:
- We get
PAGE_SIZE
number of records from the table. Starting offset value is 0. - Use the max returned value for
user_id
in the batch as the offset for the next page. - Get the next batch from the records which have
user_id
value higher than currentoffset
.
The query in action for 5 000th page, each page contains data about 10 000 users:
SELECT user_id, external_id, name, metadata, date_created
FROM users
WHERE user_id > 51 234 123 --- value of user_id for 50 000 000th record
ORDER BY user_id ASC
LIMIT 10 000;
10 000 rows in set (0.03 sec)
Wow, it is significantly faster than the previous approach. More than in 1000 times.
Note, that the values of user_id
are not sequential and can have gaps like 25 348 is right after 25 345. The solution also works if any records from future pages are deleted - even in that case query does not skip records. Sweet, right?
Explaining performance
For further learning, I recommend investigating results of EXPLAIN EXTENDED
for each version of the query to get the next 10 000 records after 50 000 000.
| Solution | Time | Type | Keys | Rows | Filtered | Extra
------------------------------------------------------------------------------
| 1. Obvious | Never | ALL | NULL | 100M | 100.00 | NULL
| 2. Offset paging | 40.81 sec | index | NULL / PRI | 50M | 200.00 | NULL
| 3. Keyset paging | 0.03 sec | range | PRI / PRI | 50M | 100.00 | Using where
Let's focus on the key difference between execution plans for 2nd and 3rd solutions since the 1st one is not practically useful for large tables.
-
Join type:
index
vsrange
. The first one means that whole index tree is scanned to find the records.range
type tells us that index is used only to find matching rows within a specified range. So,range
type is faster thanindex
. -
Possible keys:
NULL
vsPRIMARY
. The column shows the keys that can be used by MySQL. BTW, looking into keys column, we can see that eventuallyPRIMARY
key is used for the both queries. -
Rows:
50 010 000
vs50 000 000
. The value displays a number of records analyzed before returning the result. For the 2nd query, the value depends on how deep is our scroll. For example, if we try to get the next10 000
records after 9999th page then99 990 000
records are examined. In opposite, the 3rd query has a constant value; it does not matter if we load data for the 1st page of the very last one. It is always half size of the table. -
Filtered:
200.00
vs100.00
. The column indicates estimated the percentage of the table to be filtered before processing. Having the higher value is better. The value of100.00
means that the query looks thru the whole table. For the 2nd query, the value is not constant and depends on the page number: if we ask 1st page the value of filtered column would be1000000.00
. For the very last page, it would be100.00
. -
Extra:
NULL
vsUsing where
. Provides additional information about how MySQL resolves the query. Usage ofWHERE
onPRIMARY
key make the query execution faster.
I suspect that join type is the parameter of the query that made the largest contribution to performance to make the 3rd query faster. Another important thing is that the 2nd query is extremely dependent on the number of the page to scroll. More deep pagination is slower in that case.
More guidance about understaing output for EXPLAIN
command can be found in the official documentation for your RDBMS.
Summary
The main topic for the blog post was related to scanning a large table with 100 000 000 records using offset
with a primary key (keyset pagination). Overall, 3 different approaches were reviewed and tested on the corresponding dataset. I recommend only one of them if you need to scan a mutable large table.
Also, we revised usage of EXPLAIN EXTENDED
command to analyze execution plan of MySQL queries. I am sure that other RDBMS have analogs for the functionality.
In the next chapter, we will pay attention to data aggregation and storage optimization. Stay tuned!
What's your method of scanning large tables?
Do you remember any other purpose of using offset on the primary key?
Top comments (10)
Thanks Viach for the great report!
Performance difference among the three solutions is chaotic.
Regarding the keyset pagination:
It assumes that one we will iterate through the resulting pages sequentially.
I mean you can't retrieve the 50th page, without retrieving the 49th, etc.
Is that right?
In many cases though this is a negligible trade-off.
Hey, Argyrios
Thank you for the good words!
I did not have the task [jumping onto a random page] since I needed to scan the whole table. But I think that the solution would work in the following way:
user_id
of the 10th page - the first one that is interesting for you. You use your page size (20) to make it (in the blog post page size of 10 000 is used).user_id
that starts a page.As you can see, the approach should work in production, but providing a link to a page with a random number (say, 1234th) requires scanning from the very first page every time. If the dataset is immutable, we can try to use caching.
If you're interested, I can practically test the solution or any other suggested one on my dataset for the next blog post in the series about SQL <3. It can be not very bad for the first hundreds of pages and depends on the size of a page.
Viach,
Thanks for the prompt reply!
I like the caching approach, in any case.
It would further improve performance, since the user reveals their intentions when they submit the first query. Then, the backend partitions the result into pages. (we actually considered this caching technique here!).
I didn't put much thought on dataset immutability, though. Indeed, it seems to be an important factor.
What do you mean by caching though? Caching the
user_id
boundaries (i.e. the first for every page), or caching all the pages?I can think about the following approach:
We go thru the dataset from the very first record and split it into pages with a predefined size. The goal of the process is to have
user_id
boundaries as you mentioned before. Example for a page size of 10 000:You may notice that the difference between
user_ids
in boundaries is more than 10 000 - this is because we can have gaps (deleted users).When you need to render all users for a page #3, you look into the cache and user the
user_id
boundaries for the purpose.But when a user on page #3 is deleted - you need to recalculate the cache for all pages after this one. There is no reason to do that for pages #1 and #2 since boundaries for the users there are unchanged.
Sorry for the delay with the response this time.
We have no use cases where scanning large tables is needed. At least in the UI. We return the first page, then second page (skipping the first) and so on. End users end pagination after a few pages => they must refine the query.
For backend processes we just use a ResultSet, but there's no need to paginate, obviously.
... with 100 000 000 records, the query is never finished. The DBMS just kills it. Why? Probably, because it led to the attempt to load the whole table into RAM. Before returning data to the client.
But you don't mean that the database would not return data to the client application before it had finished reading the data, do you? You might like to clarify exactly which client you mean, here.
Also, mandatory kudos to @MarkusWinand for use-the-index-luke.com/no-offset
Definitely, the whole Use The Index Luke is great.
Depending on whether or not your ux SLAs will permit it, you can pre-gen all your pagination ranges and store them in a cache. If your ux can handle a 10s delay from post to publish, that gives a 10000ms temporal window in which a large portion of a pagination query can be cached. Cache each size of list display. Ex: Show: 100, 250, 500 items.
Cool?
It's a very similar approach to the "Search After" feature of Elasticsearch (elastic.co/guide/en/elasticsearch/...), which has exactly the same problem with the "classical" offset/limit pagination.
By the way, using this method with MySQL too, for a WebAPI used to download all entities of a DB.
Thanks for documenting it!