Chan Ro

Posted on Sep 8, 2023

Elasticsearch bulk reindex strategy

#elasticsearch

What is reindexing?

Elasticsearch redindexing is rebuilding index using the data that are already stored in the Elasticsearch table to replace old index.

Now, the question is why do we need to reindex?

Elasticsearch is NoSQL database that is not schemaless, means each object property in Elasticsearch needs a data type (Elasticsearch sets data type to text and keyword by default). So There's some restriction Elasticsearch when we need to change data type of a existing property.

Example case

For a simple example, let's say we have a schema:

{
  "properties" : {
    "phoneNumber" : {
      "type" : "keyword"
    }
  }
}

Now, while developement, development team realised that the phoneNumber property needs to be number to support sorting by numeric. Thus, the team changed the schema:

{
  "properties" : {
    "phoneNumber" : {
      "type" : "number"
    }
  }
}

Now, when the team tries to update schema/mappings, this will throw an error because there are records already indexed and values cannot be converted into numbers so only way to resolve this is by running reindex sequence For more info

Reindexing small database

Reindexing a small database is pretty simple. You can just run:

POST _reindex
{
  "source": {
    "index": "my-index-000001"
  },
  "dest": {
    "index": "my-new-index-000001"
  }
}

Reindexing large database

There is a problom with reindexing a large database. The process reindexing needs to copy each records to new index(indice) so the process itself is not fast and in large database, it will take quite a while to complete the process. (I am talking about hours or maybe days..)

But that does not mean users cannot read data from old index while reindex is happening. However, what if users adding more records while reindexing is under progress? Then we may or may not lose these new records.

In reality, we need to somehow allow users to access the data while reindexing is happening (Dynamic reindexing) because we cannot let users to wait for hours or days (Unless you have a good excuse for clients and put your application under maintenance.)

Alias

Core key of the solution is Alias. Alias is secondary name that we can set onto index(indice) and you can set same alias onto multiple indexes(indice) and then we can access those indexes by using alias name. (Link)

Approach 1.

Simple approach where we set alias on both old and new index so that we can read data from both indexes and ensures that none of additional records is missed while reindex is running in the background.

However, the problem with this is that this will show duplicated results on read as reindex is copying data from old to new index.

Approach 2.

(Sorry for low quality img)

In this approach, I separated into two indexes as well as alias, writer and reader but writer index with 2 alias (reader and writer). On reindexing, we will create another 2 indexes, one for writer and one for reader.

Then I ran reindexing from old reader and writer to new reader index and removed writer alias from old writer index so that users can no longer add new records into old writer index. And added writer alias to the new writer index which is now the index for adding new records.

After reindex finishes, I moved the reader alias to the new reader index and then removed old indexes safely.

This way, users can read data while reindexing and also new records can be added without worrying reindexing (This excludes regarding available data nodes.. thats different story)

DEV Community