Diksha Sharma

Posted on May 22

I Finally Understood Elasticsearch After Thinking About Libraries

#tutorial #beginners #learning #database

Imagine Elasticsearch as a huge digital library system, and Apache Lucene as the high-performance search engine library working behind the scenes. Elasticsearch is built on top of Lucene to provide distributed storage and extremely fast searching capabilities.

A library contains different corners or sections based on genres like cybersecurity, history, fiction, science, etc. Similarly, Elasticsearch contains indexes, where each index stores a collection of similar types of data.

Inside those sections, shelves contain books. In Elasticsearch, indexes contain documents, which are the actual units of stored data.

Now imagine a librarian helping visitors search for books. That librarian is similar to a node in Elasticsearch.

Technically, a node is a system/server running Elasticsearch that:

stores data
processes requests
searches data
communicates with other nodes

Logically:

Library system → Elasticsearch Cluster
Genre section → Index
Book → Document
Librarian/server → Node

Now imagine the library becomes extremely large and suddenly 50 visitors arrive at the same time. If only one librarian is responsible for searching every book requested by all 50 visitors, the process becomes very slow and inefficient.

To solve this problem, the library divides the books into smaller portions and distributes them across multiple librarians. In Elasticsearch, this concept is called sharding.

A shard is a smaller partition of an index. Instead of storing the entire index on one node:

Elasticsearch splits the data into shards
distributes those shards across multiple nodes
allows searches to happen in parallel

This improves:

performance
scalability
speed

Elasticsearch also creates replica shards, which are copies of primary shards. Replica shards help with:

fault tolerance
high availability
faster searching

So, shards can be distributed across multiple nodes, and all nodes work together as part of a cluster.

If one node contains information related to a search request, it communicates with other nodes internally to retrieve or share data. This node-to-node communication happens through the transport interface.

Elasticsearch nodes communicate using two interfaces:

1) HTTP Interface (Port 9200)

Used by:

clients
applications
Postman
curl
Kibana

to interact with Elasticsearch.

When a client sends a request:

the node receiving the request becomes the coordinating node
this node manages and routes the request internally

The coordinating node checks:

which shard contains the required data
which node contains that shard

Then, the request is sent to other nodes through the transport layer using TCP communication on port 9300.

Each node searches its own shards in parallel, and the results are returned back to the coordinating node, which merges the responses and sends the final result back to the client through the HTTP interface on port 9200.

Important:
Any node in Elasticsearch can act as a coordinating node.

2) Transport Interface (Port 9300)

Used internally by Elasticsearch nodes for:

node-to-node communication
shard coordination
replication
cluster communication
remote cluster communication

This communication happens using a high-performance binary TCP protocol.

There is also an important concept called binding address and publish address.

Binding Address

Defines where Elasticsearch listens for incoming traffic.

In simple words:

“Which IP + port should Elasticsearch accept connections on?”

When Elasticsearch starts, it tells the operating system:

“Send incoming traffic for this IP and port to me.”

Publish Address

Defines the address Elasticsearch shares with other nodes and clients.

In simple words:

“Which address should other nodes use to communicate with me?”

Now, how is Elasticsearch optimized for extremely fast searching?

Suppose you search:

"I love cybersecurity"

Elasticsearch does not scan every document one by one.

Instead, when new data is stored, Elasticsearch:

breaks text into smaller tokens/words
creates an inverted index

An inverted index stores mappings like:

Word	Documents
love	Doc1, Doc7
cybersecurity	Doc1, Doc3
elasticsearch	Doc2, Doc5

So instead of checking every document during a search, Elasticsearch directly jumps to the documents containing the required tokens.

This is one of the major reasons why Elasticsearch is extremely fast and scalable even when handling massive amounts of data.

DEV Community

I Finally Understood Elasticsearch After Thinking About Libraries

1) HTTP Interface (Port 9200)

2) Transport Interface (Port 9300)

Binding Address

Publish Address

Top comments (0)