Daniil Roman

Posted on Apr 28, 2022 • Edited on Apr 30, 2022

How we built Elasticsearch index

#database #architecture #elasticsearch

What we wanted to do

Hey everyone. I’m Daniil and I work at Just AI where we are building a chatbot development platform. Our platform has its own DSL that enables users to create a chatbot scenario with less effort. That DSL helps describe bot behavior and enrich your bot with complex logic using javascript. Chatbot developers who work on our platform use our web IDE which supports this DSL.

Bot scenarios can have a lot of files. And of course, you’ll need to search those files to find information.
Let's take a look at what kind of search exactly we’ll get. To put it short, we wanted to get an IDE-like search. Where you can find a result using not only a part of the word but also regex, or a whole word match. And case sensitivity also should be there.

Here’s what we got:

What you will and won’t find in this article

Here you’ll discover how to build an index we finally built. Some knowledge of Elasticsearch will help you better understand this article.

While I’ll also explain a couple of Elasticsearch concepts, this article is definitely not a tutorial of Elasticsearch.

Why Elasticsearch?

You’ll probably ask “Why Elasticsearch exactly?”. But the answer is really simple. Our team just didn’t have enough experience with search engines . So the decision to choose the most popular one was really straightforward and obvious. Furthermore, our operations team had experience in maintaining Elasticsearch for Graylog. Almost everyone knows about Elasticsearch. It is the most popular search engine for searching logs and there are a lot of tutorials and articles about that search engine on the Internet.

How do we store our files in source datastore?

As I mentioned above, originally our data was programming code. For example js files and sc files that you can see above. So let's take a look at our source files in detail. We store our data in MongoDB and every single file is a separate document in the MongoDB collection.

Let’s look at an example:

theme: /

    state: Start
        q!: $regex</start>
        a: Let's start.

    state: Hello
        intent!: /hello
        a: Hello hello

    state: Bye
        intent!: /bye
        a: Bye bye

    state: NoMatch
        event!: noMatch
        a: I do not understand. You said: {{$request.query}}

MongoDB structure looks like that:

{
    "fileName": String,
    "content": BinData
}

And data in the specific document looks like that:

{
  "fileName": "main.sc",
    "content":"ewogICAgInByb2plY3QiO...0KICAgIF0KfQ=="
}

Now we understand our data and our goal. So the main question of this article is: Should we convert our original collection structure for Elasticsearch or not?

How to migrate data to Elasticsearch?

We have different tools to help us migrate our data to Elasticsearch. For example, Logstash is a good and time-proven tool to synchronize our data from different sources with Elasticsearch. You can also transform and filter data in this pipeline.

Pros:

Well-known and time-proven product.
You have to write only the config file without implementing all sync logic.
It is an external process that you can scale separately.

Cons:

Not obvious how complex the logic you can write with a config file.
Separation sync and transformation logic out of our main codebase could make our whole logic difficult to understand and maintain.

Logstash is a really great tool but in our case, it looked too complicated and had too many uncertainties. It didn’t seem a good fit.

So we decided to migrate our data and transfer our source collection structure to the index structure by ourselves.

The first try

Since we need to find the line number and match the position in line, we have only two options. The first one: every document is a single line, hence every document is small. The second one: every document is a whole file as origin in MongoDB but this document has a list of lines with the list’s numbers.

So now let’s start with a quick explanation on how Elasticsearch works. The heart of Elasticsearch is an inverted index. In other words, if you get a match you’ll get a whole matching document. That’s why we have to store some additional information to understand what line of the document we have for example.

Since we store the whole file in MongoDB, we decided to make the same for the first step in Elasticsearch. The main point was it would allow for less memory consumption compared to another case.

In this case, our Elasticsearch mappings look like this:

{
    "files_index": {
        "mappings": {
            "properties": {
                "fileName": {
                    "type": "keyword"
                },
                "lines": {
                    "type": "nested",
                    "properties": {
                        "line": {
                            "type": "text",
                            "analyzer": "ngram_analyzer"
                        },
                        "lineNumber": {
                            "type": "integer"
                        }
                    }
                }
            }
        }
    }
}

Thelines field is important for us. This field has a nested type, which has a warning on google search “don’t use this type” or “don’t make big nested fields”. So... let’s break this rule and do both )

In Elasticsearch file from the top of the article looks like this:

{
  "fileName": "main.sc",
    "lines": [
                    {"line": "require: slotfilling", "lineNumber": 1},
                    ...
                    {"line": "        a: I do not understand. You said: {{$request.query}}", "lineNumber": 19}
  ]
}

And it worked!

But... always there is a “but”.

Index updating could be quite often a task. And because a document in the index could be huge we had a lot of HTTP calls to Elasticsearch that failed by timeout in 30 seconds.

So looks like we got that problem from warnings) and we had to fix it.

Let’s make our index smaller

The option, where the document is a file, didn't work. So we have to choose the second option and make a document a line of the file.

In this case, we have another index structure:

{
    "files_index": {
        "mappings": {
            "properties": {
                "fileName": {
                    "type": "keyword"
                },
                "line": {
                    "type": "text",
                    "analyzer": "ngram_analyzer"
                },
                "lineNumber": {
                    "type": "integer"
                }
            }
        }
    }
}

And our document looks like this:

{
  "fileName": "main.sc",
    {
    "line": "require: slotfilling", 
        "lineNumber": 1
  }
}

Now Elasticsearch has started to work faster and more stable. And it helped to maximize the number of calls to Elasticsearch. Our initial fear of memory consumption for this case didn't materialize. Our index size almost hasn’t changed.

We described only our index structure but not a search query.

Default search

Let’s look at the example:

You have a document in Elasticsearch - “hello world”.

You have default settings and we want to find this document by searching “hel”. It looks like a real case doesn’t it?

So, in this case you’ll get nothing.

All because of such things as analyzer and tokenizer. You preprocess query string and index data using those things. And if you have an incorrect analyzer for example you won’t get a match and will get nothing. Default behavior is splitting text by spaces and special characters. So, that’s why you’ll get nothing by “wor”, but you’ll get what you want by “world”.

How can we find what we want using part of a word?

There is a way to become familiar with ngrams in Elasticsearch. This article on gitlab gave us confidence we found what we were looking for.

Ngram - is an ngram analyzer in Elasticsearch. You can set it in mappings for a field.

Example:

Save string “hello world” to index. For example we have min=3 and max=5 in our ngram analyzer settings.
It means we split text by 3, 4 and 5 symbols.
hel, ell, llo, lo , o w, ..., rld, ..., o wor, worl, world
And if we get a match with any of those substrings we’ll get a “hello world”.

So that’s how we did. Above you can see how we applied the ngran analyzer for the line field.

"line": {
  "type": "text",
  "analyzer": "ngram_analyzer"
}

But... Here comes another “but”.

It is a really good choice. Everything works well and fast. The only issue is that our index is going to take over a really big amount of memory.

We started looking for a solution to this issue and found it.

Wildcard

Wildcard is a kind of construction: hel.
It is also searching by substring but it is much faster than a plain regex. It works by special combination of ngram and regex and helps to find a compromise between query speed and index memory consumption. You can use this analyzer for index filed as well as for query strings.

And our final index structure looks like this:

{
    "files_index": {
        "mappings": {
            "properties": {
                "fileName": {
                    "type": "keyword"
                },
                "line": {
                    "type": "wildcard"
                },
                "lineNumber": {
                    "type": "integer"
                }
            }
        }
    }
}

Unlike ngramm analyzer we reduced our index memory consumption by some 4-5 times!

We tested results that weren't from our production data and production environment. We tested it on synthetic generated data. Data size was about 1 gigabytes.
First of all we executed POST /<index>/_forcemerge for index.
And after that using GET /_cat/indices/<index> we could see the size.

We don't have a big index at all (it’s about only 10gb) so it works really well for us. Almost every call to Elasticsearch is less than 0.1 second. But if you have a really huge Elasticsearch index this solution could be somewhat slow for you. But for us and most of such cases it is a perfect match.

Summary

In this article, we went from from an Elasticsearch newbies to understanding pros and cons of different solutions and using it in real cases.

I hope it was helpful and this article could save someone time and effort. And also I hope this article shows that Elasticsearch isn’t a black box that searches for something but it is more like a Lego constructor.

We still have so much more to tell t, because here we only talked about the structure of the index, but did not tackle solving all other tasks in order to get a search like in an IDE using Elasticsearch.