loading...

Why You Shouldn't Use OFFSET and LIMIT For Your Pagination

ivoecpereira profile image Ivo Pereira Originally published at ivopereira.net Updated on ・5 min read

This post was originally posted on ivopereira.net.


Gone are the days when we wouldn’t need to worry about database performance optimization.

With the advance of times and every new entrepreneur wanting to build the next Facebook combined with the mindset of collecting every possible data-point to provide better Machine Learning predictions, we, as developers, need to prepare our APIs, better than ever, to provide reliable and efficient endpoints that should be able to navigate through huge amounts of data without a sweat.

If you have been doing backend or database architecture for a while you have probably already done paging queries, like this:

SELECT * FROM table_name LIMIT 10 OFFSET 40

Right?

But if you did build your paginations such as this, I am sorry to say but you have been doing it wrong.

You don’t agree with me? You don’t need to. Slack, Shopify and Mixmax are paginating their APIs with this same concept we will be talking about today.

I challenge you to name a single backend developer who hasn‘t ever had to deal with OFFSET and LIMIT for pagination purposes. For pagination in MVPs and low-data listings it “just works“.

But when you want to build reliable and effective systems from the scratch, you might as well do it right upfront.

Today we will be discussing what the problems are with the (wrongly) widely used implementations and how to achieve a performant pagination.

What is wrong with OFFSET and LIMIT?

As we briefly explored in the past paragraphs, OFFSET and LIMIT work great for projects with low to no data usage.

The issue arises when your database starts gathering more data than your server can store in memory and you still need to paginate performantly through them all.

For that to happen the database will need to perform an inefficient Full Table Scan everytime you request a pagination (insertions and deletes may happen meanwhile and we don’t want outdated data!).

What is a Full Table Scan? A Full Table Scan (aka Sequential Scan) is a scan made in the database where every row in a table is sequentially read and the columns encountered are then checked for the validity of a condition. This type of Scan is known to be the slowest due the heavy amount of I/O reads from the disk consisting on multiple seeks as well as costly disk to memory transfers.

That means that if you have 100.000.000 users and you are requesting an OFFSET of 50.000.000, it will need to fetch all those records (that will not even be needed!), put them in memory, and only after, get the 20 results specified in the LIMIT.

So, to show a pagination like this in a website:

50.000 to 50.020 of 100.000

It would need to fetch 50.000 rows first. See how inefficient this is?

If you don’t believe me, take a look at this fiddle I’ve created. In the left panel you have a base schema that will insert 100.000 rows for our test and in the right there are is problematic query and our solution. Just click Run on the top and compare the execution time of each. The #1 (problematic query) takes at least 30x the time of the second to run.

And it gets even much worst with more data. Check out my Proof Of Concept with 10M rows.

Now this should give you some knowledge on what happens behind the scenes. If you like what you are reading, subscribe here to get more content like this.

TLDR; The higher your OFFSET, the longer the query will take.

What You Should Use Instead

This is what you should use:

SELECT * FROM table_name WHERE id > 10 LIMIT 20

This is a Cursor based pagination.

Instead of storing current OFFSET and LIMIT locally and passing it with each request, you should be storing the last received primary key (usually an ID) and the LIMIT, so the query could end up being similar to this one.

Why? Because by explicitly passing the latest read row, you are telling your DB exactly where to start the search based on an efficient indexed key and won’t have to consider any rows outside of that range.

Take into example the following comparison:

mysql> SELECT * FROM table_name LIMIT 10 OFFSET 8000001;
[...]
10 rows in set (12.80 sec)

Against our optimized version:

mysql> SELECT * FROM table_name WHERE id > 8000000 LIMIT 10;
[...]
10 rows in set (0.01 sec)

Exactly the same records were received, but the first query took 12.80sec and the second one took 0.01 sec. Can you realize the difference?

Caveats

For Cursor Pagination to work seamlessly, you will need to have a unique, sequential column (or columns), like a unique integer ID and this might be a deal-breaker in some specific cases.

As always, my advice would be to always think about the pros and cons of each table architecture and which kind of queries you will need to perform in each one. If you need to deal with a lot of related data in your queries, Lists article by Rick James might provide you deeper guidance.

If the issue we have in hands is related to not having a primary key, like if we had a many-to-many relationship table, the traditional approach of OFFSET/LIMIT is always available for these cases, however that would reintroduce potential slower queries. So I would recommend using an auto-incremented primary key in tables that you would want paginated, even if it would be just for the sake of pagination.

Conclusion

The main takeaway of this should be to always check how your queries perform whether it is with 1k rows or with 1M. Scalability is of extreme importance and if implemented correctly from the beginning could surely avoid many headaches in the future.

Oh. And please don’t forget to learn about indexes. And explain queries.

If you liked this post, subscribe here to get more content like this or check my blog.

Discussion

pic
Editor guide
Collapse
amanvishnani profile image
Aman Vishnani

This won't work if I want to sort the rows by a field name (attribute) which is not an ID / Primary Key.

SELECT id, First_Name
from users
WHERE ID > ? // <- this will fail if I don't sort by ID
ORDER by First_name asc
LIMIT 2;

Collapse
siy profile image
Sergiy Yevtushenko

Algorithm remains the same, except instead of ID we should remember from last record the key by which sorting was performed.

Collapse
leob profile image
leob

But then you also need to have an index on that key (column), otherwise it's back to a full table scan again.

But, the whole point of the blog post is a good one - signalling the fact that a table scan is being performed when using OFFSET.

I wasn't aware of that, probably I was unconsciously assuming (like everyone else) that using OFFSET and LIMIT is somehow magically "performant". And (apparently) that's not the case.

Thread Thread
siy profile image
Sergiy Yevtushenko

In general case optimization of the query is based on indexes (and a lot of other other information) anyway. The approach described in post relies on the primary key indexing, but when ordering does not match primary key ordering, then appropriate index is necessary.

Collapse
sam_ferree profile image
Sam Ferree

This is a neat trick when Id’s are incrementing integers, and maybe that’s not a base thing to do in some cases. But ahem:

“ insertions and deletes may happen meanwhile and we don’t outdated data!”

(Laughs in eventually consistent, highly indexed query models)

Consistency is an illusion. Strong Consistency doubly so.

Collapse
ivoecpereira profile image
Ivo Pereira Author

Excellent point on consistency and that quote @sam_ferree !

The point I have made is referring to the benefits in an exact comparison of this cursor-based approach with the traditional SQL approach to this situation.

Would you suggest anything different here?

Collapse
sam_ferree profile image
Sam Ferree

Nope paging by ID is pretty nifty in this use-case

Collapse
moopet profile image
Ben Sinclair

Is it lunch time already?

Collapse
itsjzt profile image
Saurabh Sharma

IMO offset based paginations have wierd consistency problems

  • what if someone added a new record, so the offset is shifted by 1 (if sorted by new), and you will get 1 record twice.
Collapse
moopet profile image
Ben Sinclair

People are used to that in pagination. Go to ebay and sort by "ending soonest" ad page through. As you go, items will disappear off the top of the results (because they... ended...) and flipping back through pages gets messy. I don't think that's unacceptable.
The same with sharing a link to page 1234 of some results - you know that by the time the recipient opens it, it'll probably be different, so if it's necessary, a "share results page" function could be built to handle it.

Collapse
itsjzt profile image
Saurabh Sharma

It would be better if they dont have to

Collapse
bilbeyt profile image
Tolga Bilbey

We have developed a Django pagination app addressing this performance issue. It utilizes primary keys and stores the values in cache. Also, cache timeout can be set to invalidate this cache. You can check this url: github.com/bilbeyt/django-fast-pag...

Collapse
abdisalan_js profile image
Abdisalan

What a coincidence, I literally published the same article today!

I like your presentation more though :)

Collapse
hugh_jeremy profile image
Hugh Jeremy

Why would the database engine perform a full table scan on a limit/offset where the limit/offset factor is indexed? Can you show the query planner output for a query that actually exhibits this behaviour? Perhaps this behaviour is specific to the database engine you are using? I run massive limit/offset pagination queries in Postgres, where the limit/offset columns are indexed, and the performance is fantastic. The query planner is adamant that no table scans are occurring.

Collapse
hugh_jeremy profile image
Hugh Jeremy

Follow up thought - Are you perhaps doing something funky like limiting/offsetting inside or after a common table expression? Maybe the query planner is dropping the ball at a CTE boundary? That's the most common way I end up blowing my leg off, performance wise... 😭😂

Collapse
feketegy profile image
George

Because most popular db engines doesn't a do full table scan, it already has optimizations for limit/offset.

This is a typical low quality article, not researched well enough and it was written for the sake of content...

Collapse
mateuszjarzyna profile image
Mateusz Jarzyna

How to implement the pagination when ID is a UUID?

Collapse
siy profile image
Sergiy Yevtushenko

There should be no difference. The only problem you may meet is that UUID does not provide any consistent ordering, so output will look rather random. If this is a problem, then you may consider alternative unique ID, for example ULID.

Collapse
ivoecpereira profile image
Ivo Pereira Author

Hi @mateusz ,

I would see that in the same category as the caveat I've described in the post. Having an auto-incremented ID as a key would provide you a much more efficient solution that trying to wrap the strategy around a UUID.

It would be possible to use a workaround solution using a created_at in combination with a UUID, but I would highly advise against that, as for the performance hit you will be having, you would be much better sorting by ID.

Collapse
perzanko profile image
Kacper Perzankowski

I thought the same. In my opinion, incremental PKs are rarely used in mature systems.

Collapse
elmuerte profile image
Michiel Hendriks

What if you do this?

SELECT * FROM table_name ORDER BY id LIMIT 10 OFFSET 8000001;

With PostgreSQL it will use the index to get your offset, no table scan needed.

It works for any b-tree based indexed column. So if you had an index on a column called first_name, you can do:

SELECT * FROM table_name ORDER BY first_name LIMIT 10 OFFSET 8000001;
Collapse
zakwillis profile image
zakwillis

Hi there thanks for the post.

Strangely, many developers have never had to deal with index based pagination. I have, but many haven't.

The main challenge with this approach is, almost never will a primary key/incrementing column be continuous after any kind of filtering, normally scope related. A good example would be an online trading account. A user wants to see their trades?
So you may end up having to use an analytic function (rownumber over) to create the user trade Id. Requires full resultset availability? This is how offset works?
Alternatively this would need to be a scope based guaranteed continuously incremented value? This becomes a design time choice?
Where this works is if you store a daily table of data which all users will access with different pagination. Then this is powerful.

Good Post and the other guy who wrote something similar.

Collapse
leob profile image
leob

Well, you CAN use it, but only if your table is small. :-)

But apart from that you're right, great post.

Collapse
ryands17 profile image
Ryan Dsouza

Amazing post! I use this specifically for infinite scrolling type paginations.

Collapse
paticopro profile image
Patico

Great article, thank you! But does it matter for NoSQL databases like Elasticsearch or mongoDb?

Collapse
abdisalan_js profile image
Abdisalan

Probably not, I’d think the query mechanism is much different