DEV Community

Cover image for I Finally Understood Elasticsearch After Thinking About Libraries
Diksha Sharma
Diksha Sharma

Posted on

I Finally Understood Elasticsearch After Thinking About Libraries

Imagine Elasticsearch as a huge digital library system, and Apache Lucene as the high-performance search engine library working behind the scenes. Elasticsearch is built on top of Lucene to provide distributed storage and extremely fast searching capabilities.

A library contains different corners or sections based on genres like cybersecurity, history, fiction, science, etc. Similarly, Elasticsearch contains indexes, where each index stores a collection of similar types of data.

Inside those sections, shelves contain books. In Elasticsearch, indexes contain documents, which are the actual units of stored data.

Now imagine a librarian helping visitors search for books. That librarian is similar to a node in Elasticsearch.

Technically, a node is a system/server running Elasticsearch that:

  • stores data
  • processes requests
  • searches data
  • communicates with other nodes

Logically:

  • Library system → Elasticsearch Cluster
  • Genre section → Index
  • Book → Document
  • Librarian/server → Node

Now imagine the library becomes extremely large and suddenly 50 visitors arrive at the same time. If only one librarian is responsible for searching every book requested by all 50 visitors, the process becomes very slow and inefficient.

To solve this problem, the library divides the books into smaller portions and distributes them across multiple librarians. In Elasticsearch, this concept is called sharding.

A shard is a smaller partition of an index. Instead of storing the entire index on one node:

  • Elasticsearch splits the data into shards
  • distributes those shards across multiple nodes
  • allows searches to happen in parallel

This improves:

  • performance
  • scalability
  • speed

Elasticsearch also creates replica shards, which are copies of primary shards. Replica shards help with:

  • fault tolerance
  • high availability
  • faster searching

So, shards can be distributed across multiple nodes, and all nodes work together as part of a cluster.

If one node contains information related to a search request, it communicates with other nodes internally to retrieve or share data. This node-to-node communication happens through the transport interface.

Elasticsearch nodes communicate using two interfaces:


1) HTTP Interface (Port 9200)

Used by:

  • clients
  • applications
  • Postman
  • curl
  • Kibana

to interact with Elasticsearch.

When a client sends a request:

  • the node receiving the request becomes the coordinating node
  • this node manages and routes the request internally

The coordinating node checks:

  • which shard contains the required data
  • which node contains that shard

Then, the request is sent to other nodes through the transport layer using TCP communication on port 9300.

Each node searches its own shards in parallel, and the results are returned back to the coordinating node, which merges the responses and sends the final result back to the client through the HTTP interface on port 9200.

Important:
Any node in Elasticsearch can act as a coordinating node.


2) Transport Interface (Port 9300)

Used internally by Elasticsearch nodes for:

  • node-to-node communication
  • shard coordination
  • replication
  • cluster communication
  • remote cluster communication

This communication happens using a high-performance binary TCP protocol.


There is also an important concept called binding address and publish address.

Binding Address

Defines where Elasticsearch listens for incoming traffic.

In simple words:

“Which IP + port should Elasticsearch accept connections on?”

When Elasticsearch starts, it tells the operating system:

“Send incoming traffic for this IP and port to me.”


Publish Address

Defines the address Elasticsearch shares with other nodes and clients.

In simple words:

“Which address should other nodes use to communicate with me?”


Now, how is Elasticsearch optimized for extremely fast searching?

Suppose you search:

"I love cybersecurity"
Enter fullscreen mode Exit fullscreen mode

Elasticsearch does not scan every document one by one.

Instead, when new data is stored, Elasticsearch:

  • breaks text into smaller tokens/words
  • creates an inverted index

An inverted index stores mappings like:

Word Documents
love Doc1, Doc7
cybersecurity Doc1, Doc3
elasticsearch Doc2, Doc5

So instead of checking every document during a search, Elasticsearch directly jumps to the documents containing the required tokens.

This is one of the major reasons why Elasticsearch is extremely fast and scalable even when handling massive amounts of data.

Top comments (0)