3 approaches to scroll through data in Elasticsearch

#database #nosql #elasticsearch

Elasticsearch is a search engine that provides full-text search capabilities. It stores data in collections called indices in a document format. In this article I go through the supported techniques to paginate through collections or as they are called in Elasticsearch "indices"

`from` / `size`

Pagination of results can be done by using the from and size parameters. The from parameter defines the number of items you want to skip from the start. The size parameter is the maximum amount of hits to be returned.

GET users/_search
{
    "from" : 0, "size" : 100,
    "query" : {
        "term" : { "user" : "john" }
    }
}

You can filter using this method. You can also sort by adding this JSON in the root level of the previous request body:

"sort": [
  {"date": "asc"},
]

In Elasticsearch, you can't paginate beyond the max_result_window index setting which is 10,000 by default. Which means that from + size should be less than that value. In practice, max_result_window is not a limitation but a safe guard against deep pagination which might crash the server since using this method requires loading the previous pages as well.

The `scroll` API

A recommend solution for efficient deep pagination and required when reaching the max_result_window limit. The scroll API can be used to retrieve large number of results. It resembles cursors in SQL databases where it involves the server in keeping where the pagination has reached so far. Also in the same manner, it's not designed to get data for user requests but rather for processing large amount of data.

In order to start scrolling, an initial request has to sent to start a search context on the server. The request also specifies how long it should stay alive with the scroll=TTL query parameter. This request keep the context alive for 1 minutes:

POST users/_search?scroll=1m
{
    "size": 100,
    "query" : {
        "term" : { "user" : "john" }
    }
}

Th response of this request returns a scroll_id value to be used in the subsequent fetch requests.

After this request, the client can start scrolling through the data. To retrieve the next page of the result (including the first page) the same request has to be sent:

POST _search/scroll
{
    "scroll" : "1m",
    "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ=="
}

As you see, the request has to specify the scroll_id which the client get from the initial request) and scroll parameter which tells the server to keep the context alive for another 1 minute.

The `search_after` parameter

The scroll API is great for deep pagination but the scroll context are costly to keep alive and they are not recommended to be used for real-time user requests. As a substitution to scroll context for these situation, the search_after parameter was introduced to the search API to allow the user to provide information about the previous page that helps retrieving the current page. That means that a certain order for the result is necessary in the search query. Let's assume the first page was retrieved with the following query:

GET users/_search
{
    "size": 10,
    "query" : {
        "term" : { "user" : "john" }
    },
    "sort": [
        {"date": "asc"}
    ]
}

For subsequent pages, we can use the sort values of the last document returned by this request and we pass these values with the search_after parameter. A later request would like something like this:

GET users/_search
{
    "size": 10,
    "query" : {
        "term" : { "user" : "john" }
    },
    "sort": [
        {"date": "asc"}
    ],
    "search_after": [1463538857]
}

The from parameter can't be used when search_after is passed as the functionality of both contradicts. This solution is very similar to the scroll API but it relieves the server from the keeping the pagination state. Which also means it always returns the latest version of the data. For this reason the sort order may change during a walk if updates or deletes happen on the index.

This solution has the clear disadvantage of not being able to get a page at random as there is a need to fetch pages from 0..99 to fetch page 100. Still the solution is good for user pagination when you can only move next/previous through the pages.

Random access with `search_after`

As explained before, the search_after parameter doesn't allow to have random-access pagination. However, there is a way to have random access by keeping statistical data about the indexes in Elasticsearch. This approach is inspired by histograms in Postgres database. Histograms contains statistics about column value distribution in the form of bucket boundary list. The idea is to implement that manually in Elasticsearch. Have an index that has documents that has the schema of the following schema:

{
    "bucket_id": 100,
    "starts_after": 102181
}

Let's call this index pagination_index. Before creating and filling this index we should decide on a bucket size. Let's say it's 1000 documents. The next step is to fill this index using the search API with the search_after parameter. Let's assume the index is called articles. the operation would look like this:

client = Elasticsearch::Client.new()
max_id = 0
bucket_id = 0
do:
    client.create(index: 'pagination_index', body: { bucket_id: bucket_id, starts_after: max_id })
    page = client.search(index: 'articles', body: { size: 1000, search_after: max_id })
    max_id = page.last.id
    bucket_id += 1
while !page.empty?

Now to paginate, each time the pagination is done in order (get next page) search_after is used from the previous page. The same as we were doing with the regular search_after pagination. When there is a need to access a random page, we query the pagination_index for the starts_after and we use it get the required page. It would like this:

client = Elasticsearch::Client.new()
page_size = 100
bucket_size = 1000
# get page 0
page = client.search(index: 'articles', body: { size: page_size })
# do some processing or rendering of results.
max_id = page.last.id
# get page 1
page = client.search(index: 'articles', body: { size: page_size, search_after: max_id })
# do some processing or rendering of results.
# get page 200
bucket_id = 200 * page_size / 1000
page_info = client.search(index: 'pagination_index', body: { bucket_id: bucket_id })
after = page_info.starts_after + 200 * page_size % 1000
page = client.search(index: 'articles', body: { size: page_size, search_after: after })

This approach works for any query including any filtering that is needed but it will only work for that specific query. The pagination_index has to be maintained regularly as well. Until it get updated the pages will be approximate. It is still though a good approach to show real-time results which requires deep random-pagination.