Elasticsearch Fundamentals

#elk #elasticsearch

Elasticsearch is an Apache Lucene-based search engine. It is a real-time, distributed, multitenant-capable full-text search engine, it offers a RESTful API based on JSON documents, it can be used for full-text search, structured search, analytics, or all three. One of its most important advantages is the capacity to search quickly by indexing the text to be searched. Many search engines have long been available with the option to search by timestamp or precise quantities, Elasticsearch distinguishes itself by running full-text searches, managing synonyms, and evaluating items based on relevancy.
Furthermore, it may provide real-time analytics and aggregation from the same data, it outperforms other search engines in this area.
Elasticsearch is widely used in many large corporations. Here are some examples of applications:

Elasticsearch is used by Netflix to deliver millions of messages to clients every day via various channels like as email, push alerts, text messaging, phone calls, and so on.
Salesforce has created a bespoke plugin on top of Elasticsearch that collects Salesforce log data, allowing for insights on organizational usage trends and user activity.
Elasticsearch is used by the New York Times to store all 15 million articles written over the previous 160 years. This allows for fantastic archival search capabilities.
Elasticsearch is used by Microsoft for search and analytics capabilities in a variety of products, including MSN, Microsoft Social Listening, and Azure Search.
Elasticsearch was utilized by eBay to provide a versatile search platform. Elasticsearch is not solely utilized by major enterprises, it is also used by many startups and small businesses. Elasticsearch's appeal is that it can operate on a laptop or expand up to hundreds of servers and petabytes of data.

Key features

It offers statistics and real-time search for your data
Elasticsearch can run on anything from a basic laptop to hundreds of nodes and is a distributed system.
It may be used to deploy multitenant, highly available clusters. It automatically rearranges and rebalances data upon the addition of a new node or the failure of a node.
Elasticsearch distributes the processing of queries and data storage among many data nodes. Scalability, dependability, and performance are all improved.
Data in an Elasticsearch cluster is duplicated across several nodes, so even if one node fails, it is still accessible.

Elasticsearch can comprehend and search natural language text since it is built on top of Lucene, a full-text search technology.
Rather of storing documents as rows in a table, Elasticsearch saves them as JSON.

Elasticsearch makes use of a JSON-based query language rather than a SQL-based one

Elasticsearch does not enable JOINS between tables, in contrast to relational databases.
Word aggregations, geographic searches, and support for scripting languages are just a few of Elasticsearch's built-in analytical features.
In relational databases, a schema is the equivalent of a mapping in Elasticsearch. Elasticsearch will automatically assign a data type to a document field if one isn't explicitly specified if it hasn't before.

Key Components

Cluster
A cluster is a grouping of one or more nodes that collectively contains all of the data and offers federated indexing and search capabilities across all nodes. Each node in a cluster should be given a distinct name.
Node
Node generally refers to a server that functions as a cluster member. A node is an instance, not a machine, in Elasticsearch context. This implies that several nodes can operate on a single machine. An Elasticsearch instance consists of one or more cluster-based nodes. By
default, a node also starts up when an Elasticsearch instance does.
A distinctive name is used to identify each node. A random UUID is used as the node identification at initialization if the node identifier is not supplied. The 'cluster.name' field is a part of every node setup. The cluster will automatically create, with each node having the same 'cluster.name' upon launch.
A node must carry out several tasks:
- Storing data
- Processing data (indexing, searching, aggregation, etc.)
- Preserving the cluster's health

In a cluster, all these operations are available to every node. Elasticsearch offers the option to distribute duties among several nodes. This makes scaling, optimizing, and maintaining the cluster simple.
The three primary ways to setup an Elasticsearch node are as follows:
Elasticsearch master node controls the Elasticsearch cluster by processing one cluster state at a time and broadcasting the state to all other nodes. The master node is in charge of all clusterwide operations, including the creation and deletion of indexes.
Elasticsearch data node contains data and the inverted index. This is the default configuration for nodes.
Elasticsearch client node serves as a load balancer that routes incoming requests to various cluster nodes.

Port 9200 and Port 9300
Two primary ports are used by the Elasticsearch architecture for communication:
Filtering queries originating from outside the cluster is done using port 9200. This procedure responds to queries sent using REST APIs, which are used for querying, indexing, and other functions.
For inter-node communication, use port 9300. The transport layer is where this happens.
Shards of Elasticsearch
Shards are the fundamental pieces of indexing that make up the Elasticsearch architecture. They are compact and scalable.
You may store an unlimited number of documents on each index. Elasticsearch might, however, break if an index exceeds the hosting server's storage restrictions. Sharding, or dividing indexes into smaller parts, solves this problem. You can spread activities among shards to increase performance as a whole. The number of shards you produce after generating an index is up to you. Every shard functions as a separate Lucene index that can be hosted anywhere in the cluster.

Elasticsearch Replicas
Replicas in Elasticsearch are copies of index shards. For backup and recovery reasons, replicas are utilized as a fail-safe strategy. Duplicates are never added on the node hosting the primary (original) shards, replicas are kept at several places to assure availability. After the index is formed, replicas may be defined and made in any number and because of this, you can store more replicas than primary shards.
Index
Index is a container to store data similar to a database in the relational databases. An index contains a collection of documents that have similar characteristics or are logically related. If we use an e-commerce website as an example, there will be indexes for customers, items, and so on.
We can create as many indexes as necessary inside a single cluster, depending on our needs.
Elasticsearch searches an index rather than the text directly. As a result, it enables quick search results. Instead of searching every word on every page of the book, you may scan the index at the back of a book to find pages in the book that are relevant to a term.
The name "Inverted Index" refers to this form of index because it converts a word-centric data structure (words->pages) to a page-centric data structure (pages->words). Elasticsearch has support for inverted indexes, which are built and maintained using Apache Lucene