ChunTing Wu

Posted on Dec 19, 2022

Explaining Pagination in ElasticSearch

#opensource

Pagination is a common technique for web page presentation. When there is a lot of returned data, either to reduce the load on the backend or to improve the user experience, it is usually a good idea to limit the amount of presentation and keep the option to continue browsing.

Here's a quick look at the most common appearance.

< [1] | 2 | 3 | 4 | ... >

The user can know the current page number, choose the previous or next page, or even jump directly to the specified page. There are three types of pagination approaches as follows.

Offset Pagination
Keyset Pagination
Cursor-based Pagination

But this article is to introduce ElasticSearch pagination, so I will briefly describe these three.

Offset Pagination

This is the most common pagination pattern, and let's represent it in SQL.

SELECT *
FROM table_name
LIMIT 10 OFFSET 20;

Suppose there are 10 records on one page, then we can get the result of the third page by this command. Use LIMIT to limit the number of records per page and use OFFSET to jump to the specified page.

There are many advantages of such a pagination.

It is very easy to implement, as well as very intuitive.
It is possible to jump to a random page.

But the disadvantages are also obvious.

Performance is a problem.

Why is there a performance problem?

In the above example, it looks like we only need to take 10 records from the database, but in fact, the database will take 30 records and discard the first 20 records. When OFFSET is very large then the database will still take all the records and create a lot of overhead.

Keyset Pagination

In order to solve the performance problem, we have the keyset pagination approach.

We still use SQL as an example, following the above scenario, we limit the number of records per page to 10 and take the third page.

SELECT *
FROM table_name
WHERE id > 20
ORDER BY id
LIMIT 10;

Using LIMIT to limit the number of records per page is the same as above, except that we first sort by ORDER BY and then set the starting position by WHERE instead of OFFSET. In this way, we can get the third page without taking 30 records.

There are several implementation details that must be paid attention.

Must be able to sort quickly, in the case of relational databases, the columns to be sorted must have indexes.
Where do the WHERE clause come from? It can be from the frontend with the last position of pages, or it can be the backend through some storage mechanism.

Of course, there are advantages and disadvantages to such a pagination. The advantages are as follows.

It works well when implemented correctly, e.g., on index columns.
It can run on very large data sets.

But these advantages come at a price.

difficult to implement
no way to jump to random pages

I believe the first drawback is easy to understand, after all, it is not a very intuitive approach, the WHERE clause is determined by engineering methods.

The second drawback is to jump to a specific page, there must be a correct WHERE clause, but this WHERE clause depends on the last page jump result, whether it is frontend or backend processing. Therefore, users cannot jump pages as they wish.

Cursor-based Pagination

This is an advanced version of keyset pagination, which is actually a special case of cursor-based pagination.

A cursor is an object defined by the engineering side to mark where the pagination starts. There are several common types of cursors.

Encoded cursor, such as base64. Suppose the cursor is eyJpZCI6IDIwfQ==, then we will find out it is a JSON format string and the specified id is 20.
Token cursor, the backend generates a token for each search result and stores the token, the frontend can use the token to specify the jump page, the backend can know where to start from based on the token.

No matter which cursor is used, the backend still uses the keyset pagination mechanism for pagination.

The advantages and disadvantages of cursor-based pagination are fully inherited from keyset pagination, but cursor has the extra advantage that cursor can store more information, such as session timeout, user privilege, etc.

How about ElasticSearch

The examples above use SQL as an example, but in fact ElasticSearch supports these methods as well.

Here is an example of offset pagination.

GET /index_name/_search
{
  "from": 20,
  "size": 10,
  "query": {
    "match_all": {}
  }
}

Similar to the principle of SQL, from is used to specify where to start, and size is used to limit the number of records per page. Then, of course, there are also scalability limitations, and moreover, ElasticSearch directly limits the maximum amount of from and size in order to avoid poorly performing queries from affecting the health of the cluster.

max_result_window: The maximum amount of data that can be searched at once, the default is 10000. If from + size > 10000, it will directly return an error.

How to solve it? Use the keyword search_after for keyset pagination.

GET index_name/_search
{
  "size": 10,
  "sort": [
    {
      "id": {
        "order": "asc"
      }
    },
  ],
  "query": {
    "match_all": {}
  },
  "search_after": [
    20
  ]
}

This is a typical implementation of keyset pagination on ElasticSearch.

So does ElasticSearch have support for cursor-based pagination? Yes, and unlike SQL, you have to implement your own cursor in the application. ElasticSearch already has its own cursor mechanism, called the Scroll API.

The principle is to create a snapshot of the current query and jump to the next page by calling scroll every time. A practical example would look like the following one.

POST index_name/_search?scroll=1m
{
  "size": 10,
  "sort": [
    {
      "id": {
        "order": "asc"
      }
    },
  ],
  "query": {
    "match_all": {}
  }
}

This will get a response with _scroll_id, and then we can jump to the next page just by using this _scroll_id.

POST /_search/scroll
{
  "scroll": "1m",
  "scroll_id": "OXOXQQ=="
}

Since it is a cursor-based pagination, we cannot specify which page we want to jump to, we can only keep going to the next page until there is no result.

Because Scroll API is a snapshot of the current query result, so we must carefully choose the TTL or delete the snapshot after finishing, otherwise it will occupy the hard drive space.

DELETE /_search/scroll
{
  "scroll_id" : "OXOXQQ=="
}

In the new version of ElasticSearch, it is no longer recommended to use Scroll API for deep pagination, instead, another new mechanism (released after 7.10), PIT (Point In Time).

PIT works similarly to Scroll API, but is more flexible and better optimized for performance.

Scroll API takes a snapshot of a single query and can only jump pages on that snapshot, but PIT takes a snapshot of the current data set and can do anything after getting the snapshot, not just jump pages.

However, this is not related to the pagination, so I will not dive into PIT in this article.

Conclusion

In general, there are two major types of pagination, offset pagination and keyset pagination, and only keyset pagination can run on big data, but if you want to jump pages randomly, only offset pagination can, which is a trade-off.

In fact, in a big data scenario, if the feature requirements strictly define the maximum number of pages, then even using offset pagination will not affect performance.

Therefore, it is not just a technical decision to choose the implementation method, but more often a trade-off in terms of feature requirements. Nevertheless, engineers should know if there is a requirement for random page jumping on big data without page limit, it is time to say no, please clearly reject.

DEV Community