DEV Community: Aneesh Makala

Vectorization in OLAP Databases

Aneesh Makala — Sun, 24 Apr 2022 13:50:35 +0000

A recent evolution in OLAP (Online Analytical Processing) database systems has been to overhaul the SQL query engine from the traditional ones setup for OLTP (Online Transactional Processing) usecases. They have done this by either using vectorization, or just-in-time (JIT) compilation. In this article, I want to dive a little deep into why and how vectorization helps.

Background

Before we go into vectorization, it would help to understand how database query engines are typically implemented. Andy Pavlo does a brilliant job at explaining it here, but to summarize, they follow what is called a "pipeline" model. Each operator in the query tree essentially calls next on its child node, which calls next on its child, processes it, and returns it.

In the image above, for example, the SELECT operator for Dname=marketing (not the same as SELECT in SQL, mind you!) calls "next" on its child, the department table scan operator. That operator scans the table, maybe sequentially or using an index, and then returns a single row back. The SELECT operator then proceeds to apply the filter and then if there is a match, returns the tuple further up the tree. In this way, each tuple is fed from the leaf nodes up to the top of the tree in a "pipeline". It's worth noting that unlike the SELECT operator which essentially processes one tuple at a time in the pipeline, for some operators, like the join for example, all the rows from one of its child would need to be materialized(i.e. collected and saved in memory) before it can proceed. These are often referred to as "pipeline-breakers".

Vectorization

Vectorization is very similar to the pipeline model, with one important difference in that each operator returns a batch of tuples as opposed to a single tuple. This is a popular technique that has been used in many popular OLAP engines like Presto, Redshift, and Snowflake. At first, it seems like a rather simple change, but it has significant performance implications.

Reduced Invocations

A typical example of an OLAP query may be to analyze large amounts of data, and produce aggregations like sums or averages. While we take function calls for granted, the time to handle an invocation for each row, for each operator, for billions of rows, can add up pretty quickly. Vectorization means less number of invocations because of batching, meaning better performance overall.

SIMD

Single Instruction Multiple Data (SIMD) is well, the real reason why vectorization helps. Modern processors allow you to perform the same CPU instruction on multiple data with the help of 128-512 bit sized registers. So, if you want to add two integer vectors of size 1000 (like in the image below), you'd need 1000 ADD instructions. With an 8-lane SIMD, i.e. a 256-bit register to store 8 ints (with each int = 32 bits), you'd need roughly 1000/8=125 operations. That's a huge improvement! (theoretically, at least)

In C++, you can use intrinsics which provide a way to leverage these registers on Intel hardware. These functions can be hard to grok (very much in-line with the readability struggle in optimal code). But, as a small example, _mm256_add_epi32(a, b) basically adds two vectors a and b of 256 bits with 8 32-bit integers packed together!

Example

Now, let's walk through an actual example. Let's say you want to compute the average of 100 million integers (cause, why not).

The basic pipeline model would involve two operators, the sequential scan operator, that reads each int one at a time, and an aggregation operator, that adds up all the integers and then computes the average. Pretty simple, here's some pseudo-code. The AggregationOperator has the SequentialScanOperator as its child, which it calls iteratively, until completion.

class SequentialScanOperator:
  def next():
    x = #read int from memory/disk
    return tuple(x,)

class AggregationOperator:
  def next():
    while True:
      input = childOperator.next()
      if not input:
        break
      total_sum += input
      total_len += 1
    return tuple(total_sum/total_len,)

On an Intel Xeon CPU (2.40 Ghz), this took 0.463 seconds for 100 million integers.

Now, for the vectorized execution, the operators and the overall setup remains the same, except for the fact that now, each operator's next() method returns a batch of tuples instead of a single tuple. The machine was used for the experiment had AVX2 intrinsics supported (256-bit registers essentially), so we can pack 8 integers into a register. Therefore, let's say that each operator now processes in a batch of 8 tuples. The SequentialScanOperator will now return 8 tuples, and the AggregationOperator will now read 8 input tuples from its child at a time. Of course, since we want to aggregate all the values down to a single value, the output of the AggregationOperator will be a batch, but only of 1 tuple. Here's some pseudo-code (with intrinsics):

class SequentialScanOperator:
  def next():
    xv = []
    for i in 1..8:
      xv[i] = #read int from memory/disk
    return xv

class AggregationOperator:
  def next():
    # initialize a 256-bit vector with 0s
    __m256i aggVector = _mm256_setzero_si256();
    while True:
      input_tuple_vector = childOperator.next()
      if not input_tuple_vector:
        break
        # load inputs into a 256-bit temp vector
      aggVectorTemp = _mm256_set_epi32(input_tuple_vector[0], input_tuple_vector[1],
         input_tuple_vector[2], input_tuple_vector[3], 
         input_tuple_vector[4], input_tuple_vector[5], 
         input_tuple_vector[6], input_tuple_vector[7]);
      # vectorized add temp vector to result vector
      aggVector = _mm256_add_epi32(aggVector, aggVectorTemp);
      total_len += 8

    output = []
    # load it back to a (c++) array
    _mm256_store_si256((__m256i*)output, aggVector);
    # final reduce of vector to a single int
    total_sum = sum(output)
    return [(total_sum/total_len),]

On the same machine, this code took 0.199 seconds. That's a 2.32x speedup! While that's great, a question that immediately comes to mind is - shouldn't we expect a speedup closer to 8, since we packed 8 integers together, and essentially performed a single instruction on them? And the answer is - not necessarily. In this example, the program is bound by memory, so the memory bandwidth usually becomes the bottleneck. With vectorization, we've sped up the computation, but we still have to read the data from memory (in the example above), or from disk. And memory is typically much slower compared to the CPU. Therefore, we don't see the gains we might expect. This SO answer explains it further!

P.S. The actual code that ran this example can be found here!

The journey from Supercomputing to Spark

Aneesh Makala — Sat, 29 Jan 2022 19:43:19 +0000

In the world of computer science, there are so many tools and technologies out there, that it can get overwhelming rather quickly. Before you've completed learning how to use a tool, one of the big tech companies has open-sourced yet another cool project. The problem is: how do you keep up? One of the things that has helped me is to focus more on the "why" rather than the "how". It's impossible to know how to use every single tool out there, and so it's reasonable to learn how to use them when you need to, say on the job, or when you're working on a side project. But spending time understanding the "why", the motivation of the project goes a long way. it enables you to quickly understand the problems that it's solving, why other similar projects are built, how to compare and contrast them, and most importantly, how to pick the right tool for the job at hand. (Yes, I used to be a Python guy and build everything in python, but I've now realised that it's not about the language, it's about understanding the problem at hand, and finding the right tool that can solve the problem and fits well into your existing architecture 🥲 ).

So, let's take a 10000ft view of the journey all the way from supercomputing to spark, and see why some of the technologies were built, what their limitations were, and how they sparked the need for the next generation of tools to be built.

Back in the day, the only approach for scaling was vertical. If you had a computationally intensive task, you would beef up the machine. In other words, supercomputing. Of course, that works well, to an extent. But that has its limitations:

To take maximum advantage of the high-performance hardware, you have to write highly efficient code, which requires expertise and sometimes comes with sacrificed readability.
Data size started becoming huge, which means that to complete within a reasonable amount of time, even with high-performant computers, a single machine wouldn't be sufficient.

Enter, cluster computing. A lot of cheap, unreliable machines that can be used to achieve reliable computing. Well, that seems reasonable! But of course, it's not paradise-land yet. With solutions to problems come more, and better problems. Yes, you could now leverage a cluster, but that means writing code that could achieve fault tolerance and debugging parallel programs which requires a lot of low-level programming expertise. As a software engineer, maybe all you want to do is compute the counts of words that appear in a large set of documents, but now you have to worry about parallelizing it, and making your code fault-tolerant.

Hello, map-reduce/hadoop. The motivation here was to create a higher level of abstraction which hides all these complexities and allows the programmer to concentrate on the problem at hand. If you could model your problem into a map and a reduce procedure, you could make use of the map-reduce infrastructure that orchestrates the processing on distributed servers, managing all the communication between different nodes, in an reliable and fault-tolerant manner. In a distributed setting, network latencies can become the bottleneck, so map-reduce solved this with the help of a distributed file system called Hadoop Distributed File System (HDFS) wherein data is first partitioned on multiple nodes in a clusster and computation is moved to data, rather than the other way around. In other words, the nodes in the cluster only operate on the data that is present on their nodes, which avoids moving data around the cluster, thereby reducing latency. While its proponents thought it was a "paradigm-shift", it had its criticisms, notably from Michael Stonebraker (a specialist in database systems, and Turing Award winner). Map-reduce was attempting to solve the problem of processing big data at scale, so some of the criticisms were:

Why not use one of the then-existing high-performance parallel database systems instead of map-reduce? The database community had spent years optimizing and scaling data processing operations. Map-reduce, in fact, paled in comparison of execution times.
The lessons learnt in database community, such as "schemas are good" or "high-level access languages like SQL are good" were not incorporated by map-reduce.
It did not support indexes, or transactions.

While these were all valid points, the map-reduce community sure had counter arguments. First, it was not meant to be a DBMS, it was an algorithmic technique. Further, it was built to perform data processing on cheap, and possibly unreliable computers, in a reliable manner (as opposed to the parallel RDBMS systems, which assumed reliable hardware for the most part).
Ignoring this criticisms for a moment, map-reduce also had other limitations. Firstly, map-reduce writes to disk at every step, which isn't very efficient. Further, It had a limited set of operators - map, reduce. Map-reduce was built around an acyclic data flow model, which was not conducive for the types of applications that were starting to be built - like machine learning algorithms and interactive data analysis. That is, you could not reuse a working set of data across multiple parallel operations. If you had to run multiple analytical queries on the same data using map-reduce, you would have to load it from disk every time, which again, isn't very efficient.

Introducing, Spark. Spark is a cluster computing framework that solved these problems. It avoids writing to disk for intermediate operations, and does it in-memory instead. It does this using something called a Resilient Distributed Dataset (RDD), which is an immutable collection of data that is partitioned into chunks and distributed across the nodes in a Spark cluster. All spark functions operate on RDDs. There are a couple of key features of an RDD. One, RDDs can be cached in memory of the nodes, which makes it suitable for iterative tasks(like the data analysis usecase we discussed above). Two, Spark keeps a record of the lineage of an RDD transformations, which allows it to reconstruct the RDD if lost, thereby making the otherwise unreliable memory operations, fault tolerant. As a result, spark faster than map-reduce, and is today a popular choice for large-scale data analytics!

So, where are we, today with big data infrastructures? It's interesting to note that we started with map-reduce trading-off traditional database wisdom for simplicity and scale, but eventually we are now proceeding to incorporate many of the database techniques and address Michael Stonebraker's criticisms. For example, Spark now uses Dataframes instead of RDDs, which have schemas. Spark's query engine has a planning and optimization layer, and file formats like Parquet have adopted columnar compression techniques used in analytical databases. I guess that was the only way to move forward - iteratively, rather than focusing on solving all problems at once.

It's fascinating to fathom the path we've taken. We've come a long way from being able to process kilobytes on a single machine to petabytes on 1000s of machines. And, I'm excited to see how we advance further!

Golang for Object Oriented People

Aneesh Makala — Sun, 28 Mar 2021 13:01:51 +0000

To start off with, in golang, there are no classes, only structs.

To add fuel to the fire, there is no type hierarchy, (i.e. no inheritance). This is probably the biggest difference as compared to other object oriented languages. Actually, it can't be truly considered an object-oriented language.

Ok, let's talk about type hierarchy.

Not having a type hierarchy was an intentional design choice in golang. Type hierarchies result in brittle code. The hierarchy must be designed early, often as the first step of designing the program, and early decisions can be difficult to change once the program is written. As a consequence, the model encourages early overdesign as the programmer tries to predict every possible use the software might require, adding layers of type and abstraction just in case. This is upside down. The way pieces of a system interact should adapt as it grows, not be fixed at the dawn of time. - Source

Let's take a quick look at type embedding. At first, it seems to be a mechanism for implementing inheritance, but upon close inspection, it becomes clear that it is not. It's simply a tool to borrow pieces of an implementation.

For example, take a look at the below code:

type Animal struct {
   Name string
}

type Dog struct{
   Animal
}

// Firstly, you can't just instantiate a Dog struct by
// passing in the field "Name".
//     d := Dog{Name:"dog"}

// This is the proper way to instantiate a Dog struct
d := Dog{Animal: Animal{Name: "dog"}}

// Next, you might want to employ polymorphism by
// creating a variable of type Animal, 
// and assigning the dog struct to it.
// Again, doesn't work. Just because you "embedded" 
// the Animal struct inside the Dog struct 
// doesn't create a type hierarchy.
//     var a Animal;
//     a = Dog{Animal : Animal{Name:"dog"}}

It's clear that embedding doesn't create a type hierarchy. Note that you can still access the Name field by doing d.Name. That's because embedded fields and methods are "promoted" to the outer struct.

Ok, so if there is no type hierarchy, embedding seems like it creates one, but doesn't, how do you then group structs together to represent an abstraction? For example, how do you group together a cat and a dog as an "Animal" abstraction as you would in other OOP languages by creating an Animal super class and a Cat, Dog subclass?

Answer: You group structs not by their type, but by their behavior, i.e. interfaces.

Interfaces

An interface is just a set of methods. Any type that implements these methods satisfies this interface implicitly. The idea is that you can create an abstraction of different concrete type values and work with the values through their common behavior, without having to define a type hierarchy. Here's a nice example that explains how.

Further, Interfaces allow you to embody the dependency inversion principle. It is a recommended practice in go to accept interfaces and return structs. This way your code depends on abstractions rather than concrete implementations.

Unlike interfaces in other languages, rather than declaring the interface in the package where you're implementing a concrete type that satisfies that interface, it is recommended to declare interfaces in the package where you're using that interface.

It's an interesting and fundamental shift in how programmers
should think - You group structs not by their type, but by their behavior.

Sure, but every OOP language has interfaces. How do I reuse code in golang?

Composition over inheritance

You reuse code in golang using embedding and composition. There's a famous principle in object-oriented programming to favour composition over inheritance.
Here's a nice article that describes how go embodies that principle.

Why we need to represent a state transfer

Aneesh Makala — Sun, 09 Aug 2020 12:41:53 +0000

REST(Representational State Transfer) has pretty much become the norm now. But, as with anything else, it's crucial to understand its motivation. Why did we come up with REST? What was the situation before REST?

Well, picture this - This is the 90s. You've built software that interacts with the oracle database. All's well. You have the desktop application installed on all the computers of your company. It comes bundled with oracle client libraries and everything's running smoothly.

Now, oracle releases a new version of their database that include some critical security bug fixes. You have no choice but to upgrade. You update your code to use the newer version of oracle. Maybe that merely involves a version change, or maybe there are some backward incompatible changes which force you to rewrite few lines of code. You take the database down, upgrade it, and then push out your software update. Now, you need to make sure that all the computers use the latest version of your software, otherwise they won't work (since you have upgraded the database). You make sure you install patches on all the machines, and only then, you can go back to sleep, hoping that you didn't break anything. It's ok, don't worry. I'm sure you've followed test driven development practices strictly.

Well, if you look at this situation closely, you'll realise that there is a tight coupling between the client application and the server. The client is very aware of the version of the database used by the server. This is a clear breach of separation of concerns. So, what do we need? Hmm, it would be nice if client could just express what it needs via some standard protocol and get the information back in some particular format, without having to worry about how the server retrieved that information.

Well, that's what REST is. It's the transferring of "representations" of resources. As opposed to directly dealing with the resource, you deal with "representations" of it. You use a "representation of a resource" to transfer resource state which lives on the server into application state on the client. You make a GET HTTP request with an id (the representation of the resource that you want), and retrieve the complete state of the resource that the server is aware of.

Such a separation of concerns is what made it possible to achieve high scale. For example, it doesn't matter if you're building an application for a mobile, a desktop or a tablet, you can use the same representations to send and receive data.

Sources

Dynamic Programming for Pythonistas

Aneesh Makala — Sun, 14 Jun 2020 07:34:08 +0000

Dynamic programming is an intimidating topic when it comes to interview preparation. I always fret it. But this time, I found an intuitive way of looking at it, thanks to Python.

First, let's understand the motivation for dynamic programming. Let's say you have a problem to solve. You realise that if you solve a subproblem(which is a smaller instance of the same problem), you can use the solution to that subproblem to solve the main problem. That's the crux of recursive algorithms. For example, you want to calculate the factorial of n. Well, if only you had the factorial of n-1, you could just multiply n with that value, and you have your solution. 💯Alright now, let's focus on how to get the factorial of n-1. Well, if only you had the factorial of n-2, you could simply multiple n-1 with that value, and you have the solution. 💯💯. Now, how do we get the factorial of n-2? If only you had... Ok, I'm going to stop. You get the point.. Here's the recursive solution for it -

def fact(n):
    if n <= 1:
        return 1
    return fact(n-1) * n

So far, so good. Now, what's dynamic programming? It is merely an optimization over recursive solutions that becomes relevant when you have multiple calls to the recursive function for the same inputs. For example, let's take a look at the fibonnaci problem. The fibonacci formula is fib(n) = fib(n-1) + fib(n-2). Now, fib(5) = fib(4) + fib(3) and fib(6) = fib(5) + fib(4). Both the calculations require the value of fib(4), and in the naive recursive implementation, they will be calculated twice. Hmm, that's not very efficient. Let's see if we can do better. Let's simply cache the solutions so that we don't recompute it.

solution_cache = {}

def fib(n):
    # retrieve from cache if present
    if n in solution_cache:
        return solution_cache.get(n)

    if n <= 1:
        return n
    value = fib(n-1) + fib(n-2)

    # save to cache before returning
    solution_cache[n] = value
    return value

That's basically the essence of dynamic programming. Store the results of the subproblems, so that we don't have to recompute it later. Well, now we're more efficient, but we've sacrificed code readability. There's so much going on in the fib function. Our earlier code was so much simpler. Let's take it up a notch by making the code cleaner with a decorator in Python.

def cache_solution(func):
    solution_cache = {}

    def cached_func(n):
        # retrieve from cache if present
        if n in solution_cache:
            return solution_cache.get(n)
        func_result = func(n)
        # save to cache before returning
        solution_cache[n] = func_result
        return func_result
    return cached_func


@cache_solution
def fib(n):
    if n <= 1:
        return n

    return fib(n-1) + fib(n-2)

What we've done here is separated the concerns of solving the main problem, i.e. calculating fibonacci number, and that of caching the solution of the subproblems. This makes it easier to come up with dynamic programming solutions. First, we can concentrate primarily on coming up with the recursive solution. After that, if we need to optimise it, we simply need to write a decorator!
As a result, the code is cleaner, and more extensible. How?

The fib function follows the single responsibility principle of only dealing with the solution of calculating the Fibonacci number. If we need to alter the logic, it's much easier to do it.
We've created a reusable solution for caching the solutions of subproblems. Maybe you have another function foo whose solutions you need to cache. Just decorate it and you're done!
Now, let's say you want to use an external cache (maybe, redis). You may want to do this because you want a central cache for all your applications, or you want to persist cache to disk, or you want to employ a cache eviction policy. Now, all you have to do is make a change to the cache_solution decorator, to save the solution to redis (as opposed to a dictionary in local memory) and you're done. You can now use redis to cache the solutions for all the functions that use this decorator!

Assimilating Auth

Aneesh Makala — Sun, 10 May 2020 14:23:13 +0000

Firstly, what the hell is Auth?

Well, auth could mean "authentication" or "authorization".
Authentication is the process of verifying that you are who you say you are. Typically, this would mean validating your credentials such as your username and password.
Authorization, on the other hand is the process of determining if you (the authenticated user) have permission for a resource that you are trying to access.
To be explicit, henceforth, let's use authN and authZ to refer to authentication and authorization respectively.

When we login to a website, how does the server know I'm authenticated for every subsequent request I make?

Well, there exists a naive mechanism of authentication called Basic Authentication, which can be used by a HTTP user agent to provide a username and a password when making a request. If you use this mechanism, you'll have to include the username and password in every subsequent request.
But there are other more sophisticated mechanisms of authentication used by websites. For example, in session based authentication, after the first login, the server creates a session for the user, and returns a session_id as a cookie back to the user-agent. As long as the user is logged in, the cookie would be sent along with every subsequent request. The server can just compare the session_id it receives with the session_id that it had created to verify the user's identity.
There also exists a token-based mechanism of authentication, where the user first provides the username and password in exchange for a token. For every subsequent request, the client will send the token for the server to perform authentication.
More reading - https://stackoverflow.com/questions/51341562/alternatives-to-basic-authentication-when-logout-is-required

Hang on. Session-based and token-based sound like the same thing. What's the difference?

The biggest difference is that in token-based authentication, the user's state is not stored on the server. The payload(including the token) that is sent is self-sufficient for validation by the server(by the means of something known as digital signatures). This makes it easy to scale your application.
More reading - https://blog.angular-university.io/angular-jwt/

Ok, cool. Jumping to API development, I usually have to use an API key and a secret. What's that all about?

These are used for authN..

Wait, why do I need an API key and secret in the first place? Why can't I just use my username and password which I'm already using anyway?

Good question. One reason is to make it more complicated a task for an attacker. If the attacker is resorting to a brute-force attack, they would have to search a much larger space (the search space for a key would be much more random as compared to usernames). Also, using a key instead of a username prevents a leaked key from tracing back to the user.

Ok, but if we're choosing a separate key, why do I need a key and a secret, both? Why wouldn't a single unique code serve as username and password both?

You must first understand the difference between symmetric and asymmetric cryptographies. In symmetric cryptography, a single key is shared between both parties. The problem with this type of cryptography is that of key exchange. How do you safely exchange keys with which you plan to subsequently exchange information in a safe manner? What if the key exchange process is itself compromised?!

Enter, asymmetric cryptography, where each party has a public and a private key pair. A message encrypted with a public key can only be decrypted by its private key (and vice versa). So, if I want to send you a message that only YOU can read, I simply encrypt with your public key. You receive the message, and decrypt it with your private key! That's where the secret comes in handy.

Wait. If I'm using HTTPS, my requests should already be encrypted right? Why do I need a secret, then?

You're right. It's an additional layer of security on top of HTTPS. It may sound trivial, but more layers of security, the better.
More reading - https://stackoverflow.com/a/11561986/4434664

What is OAuth? That comes up a lot.

Ahaa. We've been talking about authN till now. OAuth is all about authZ.
Let me explain the motivation behind OAuth.
Let's say you have a photo printing web application. Your users upload a picture, and you print it out, and deliver it to them. Now, your users want the option of directly selecting an image from their google drive as opposed to downloading it and uploading it. (Users are lazy, aren't they?)
Now, one option is to blatantly ask the user for their google drive username and password. Obviously, that's not acceptable.

Fine, another option is to ask users to share the photo with you, using a sharable link. That might work for this use-case, but it might not work for other use-cases. For example, what if you want the user to share the entire list of their contacts, so that you can broadcast some message of sorts? Now, as a simple user, I don't have the capability of retrieving all that information from google drive. That's only accessible via an API. Ok so, they could create an application on google drive, and then share the API key and secret with you. But that's still too tedious for the user.

Enter, OAuth. You must have encountered some screen like this at some point.

That's OAuth in action. It's a standard created for services to access each other on behalf of the user. User is authenticated with your photo printing application. User is also authenticated with google drive. Now, you simply ask google drive for permission on behalf of the user. Google drive asks the user if they want to give you(the photo printing app) permission via the screen shown above, which you accept(you better accept...). Google drive returns an authZ token to you, using which you make subsequent requests to Google drive for retrieving photos, contacts, etc on behalf of the user.
More reading - https://developer.okta.com/blog/2017/06/21/what-the-heck-is-oauth

That makes sense! Final question - If I'm building an API, should I use key and secret, or should I use OAuth?

If the users of your API are all developers, you're better off using API keys and secrets.
You only need to use OAuth when you want to enable your user to use some third-party client application to use their data stored on your application without having to give their username/password to the third-party client application.
One other advantage of using OAuth tokens over API keys and secrets is that they also provide the feature of authZ. Tokens can be tied to scopes. You can create a token that can only read contacts. You can create another token that can not only access contacts, but also modify and delete them! Also, OAuth tokens have an expiry feature, making it more secure!
More reading - https://zapier.com/engineering/apikey-oauth-jwt/

Updating the mapping of an elasticsearch index

Aneesh Makala — Tue, 21 Apr 2020 14:11:00 +0000

Okay, so you've set up elasticsearch. You've indexed your data. Search is super fast. All's good. But, suddenly, you have a requirement for which you need to change the mapping of your index. Maybe you need to use a different analyser, or maybe it's as simple as adding a new field to your document, which requires you to add the associated static mapping.

If you find yourself in such a situation, here are a few approaches you can take -

Approach 1 - with downtime; index from external data source.
- This assumes that you have an external data source such as a database from which you can index data all over again, as if you were doing it for the first time.
- When to use?
  - This approach only makes sense for testing purposes in local or in staging. This should not be used in a production environment because downtime isn't really desirable.
- Steps
  - Delete the index using the Delete API
  - Create the index, and set the new mapping using the PUT Mapping API
  - Index documents from external data source. You could do this using the Bulk API
Approach 2 - without downtime; index from external data source
- When to use?
  - You could use this approach in production, but if you have a large number of documents, indexing from an external data source like a DB can be a time-consuming process.
- Steps
  - If not done already, create an alias index_alias for your existing index (old_index) and change your code to use the alias instead of old_index directly.
  - Create a new index new_index
  - Index documents from external data source. You could do this using the Bulk API
  - Move the alias index_alias from old_index to new_index.
- Caveats
  - While the downtime is essentially zero, there could still be consistency issues
  - Indexing from an external data source like a DB can be a time-consuming process if you have a large number of documents.
Approach 3 - without downtime; index from elasticsearch
- When to use?
  - Can be used in production when you want to change the mapping of an existing field. If you are merely adding a field mapping, prefer Approach 4
- Steps
  - If not done already, create an alias index_alias for your existing index (old_index) and change your code to use the alias instead of old_index directly.
  - Create a new index new_index
  - use elasticsearch reindex API to copy docs from old_index to new_index.
  - Move the alias index_alias from old_index to new_index.
- Caveats
  - While the downtime is essentially zero, there could still be consistency issues
Approach 4 - without downtime; update existing index
- When to use?
  - Can be used in production when you want to merely add a new field mapping.
- Steps
  - Update mappings of index online using PUT mapping API.
  - Use _update_by_query API with params
    - conflicts=proceed
      - In the context of just picking up an online mapping change, documents which have been updated during the process, and therefore have a version conflict, would have picked up the new mapping anyway. Hence, version conflicts can be ignored.
    - wait_for_completion=false so that it runs as a background task
    - refresh so that all shards of the index are updated when the request completes.
- Caveats
  - Can't be used if you want to change the mapping of an existing field. Use Approach 3 instead.