Elasticsearch as a primary database?

#elasticsearch #database #bigdata #datascience

A little thing about our dear Mr. ES.

Elasticsearch is a distributed, free and open search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured. Elasticsearch is built on Apache Lucene. Lucene is an inverted full-text index. This means that it takes all the documents, splits them into words, and then builds an index for each word. Everything here also applied to Apache Solr since both still use lucene library.

Great, now after these definitions, a few things we will like to point out so we use it properly. To start off:

Elasticsearch is not a transactional database – yes, it saves data but what you will need from a database is for it to be fully ACID compliant for multiple documents storage. For example, if you tried bulk indexing (bulk inserting) a document into Elasticsearch and for some reason due to space or red cluster or some other factor only a few passed, with Elasticsearch it won’t do “All or nothing”, which means it is supposed to roll back all in the transaction but here it would not. So I do not suggest you use it for things that are of transactional nature. For example financial record.
Elasticsearch under high load makes indexing a tough job making real-time searches hard this is because indexing and refreshing (making data available when indexed for reads) is an expensive process when documents during insertion are many (100,000+ per sec, sometimes less depending on Hardware and other factors). It’s a trade off, but some who use the lucene directly have figured some ways around, like twitter https://blog.twitter.com/engineering/en_us/topics/infrastructure/2020/reducing-search-indexing-latency-to-one-second.html
In terms of CAP Theorem, it does not fully meet the full requirement of strong consistency but that of a weak one. It a good AP system. It is distributed and makes sure it is highly available. If a storage system is not CP then DO NOT use it for financial records or things that are of transactional nature. Does not need to be relational for it to be a CP system. Some NoSQL databases are CP systems. https://www.ibm.com/cloud/learn/cap-theorem

Recommendations

Elasticsearch must be used a derived storage system to take advantage of its speed. Here, data in a derived system is the result of taking some existing data from another system and transforming or processing it in some way. If you lose derived data, you can recreate it from the original source.
Elasticsearch should be used at places where you will need almost or near real time searches or record displays. It will easily do better than your standard database when it comes to reads.
Elasticsearch is great for analytics. I do not recommend that you perform expensive aggregate queries which can be done on ES on your live database, not to mention when you need to include joins to achieve that.

DEV Community

Elasticsearch as a primary database?

Recommendations

Top comments (0)