Sivaramakrishnan Janakiraman

Posted on Nov 23

Elasticsearch Shard Imbalance: The Hidden Limitation

#performance #elasticsearch #opensearch

To truly understand Elasticsearch or OpenSearch, you must understand shards. They dictate your cluster’s performance, reliability, scalability, and even cost. Without grasping how shards are stored and balanced, you can’t master the system.

This post builds up the fundamentals and then reveals a silent limitation that can quietly damage performance — even when shard counts look perfectly balanced.

1. How Shards Are Stored Inside an Elasticsearch/OpenSearch Cluster

Before discussing shard sizing or performance issues, it’s important to understand how an index is stored in a cluster.
The diagram below illustrates this clearly:

1.1 Index → Logical Grouping of Documents

An index such as My_Index is simply a logical grouping of related documents. Internally, Elasticsearch never stores an index as a single block of data. Instead, it splits the index into multiple primary shards, which act as the actual physical storage units inside the cluster.

1.2 Primary Shards → The Authoritative Store

Primary shards hold the authoritative copy of your data. When a document is indexed, Elasticsearch uses a routing hash—based on the document’s _id unless a custom routing value is provided—to determine which shard it belongs to. The document is written to that primary shard, and future read queries can be served either by this primary shard or by any of its replica shards.

1.3 Replica Shards → High Availability & Parallel Reads

Replica shards are full copies of primary shards, stored on different data nodes to ensure availability and improve read performance. If the data node hosting a primary shard fails, its corresponding replica is immediately promoted as the new primary. Because both primaries and replicas can serve search requests, replica shards also increase read throughput and reduce query latency through parallelism.

1.4 Data Nodes → Execution & Storage Engine

Data nodes are the workhorses of the cluster. They store both primary and replica shards, execute search queries, index new documents, perform segment merges, and manage the underlying Lucene structures. Each data node holds a mixture of shards, as illustrated in the diagram, distributing the cluster’s workload across the available hardware.

1.5 Master Node (Cluster-Manager Node) → The Coordinator

The master node—called the cluster-manager node in newer versions—coordinates the overall cluster. It maintains metadata, controls shard allocation, tracks node membership, promotes replicas to primaries during failures, and manages rebalancing operations. Unlike data nodes, it does not store documents; its role is to orchestrate the cluster’s behavior and ensure consistency.

2. Shard Sizing and the Impact of Over-Sharding vs Under-Sharding

2.1 Ideal Shard Size (~30–50 GB Recommended)

A widely accepted guideline in both the Elasticsearch and OpenSearch communities is to keep shards in the range of 30–50 GB. This size is considered the sweet spot because it balances performance, memory usage, and recovery time. Large shards tend to slow down searches, increase heap pressure, and take significantly longer to recover when a node fails. They also create a larger “blast radius”: if a 500 GB shard becomes unavailable, the cluster must recover a massive amount of data at once.

On the other hand, very small shards create a different set of problems. They introduce excess metadata, increase the number of open file handles, consume additional memory, and place unnecessary pressure on the cluster-manager node. Beyond these structural overheads, tiny shards also slow down search queries at the data-node level. Each data node has a fixed pool of search threads, and a single thread can operate on only one shard at a time. When a search request needs to scan many tiny shards on the same node, these shards must be processed sequentially within each thread. This increases per-query latency, reduces parallelism efficiency, and, when multiplied across many indices, makes the entire cluster sluggish.

Getting shard size right is one of the most impactful design decisions you can make, because it directly influences performance, fault tolerance, and operational cost.

2.2 Over-Sharding vs Under-Sharding

Shard sizing issues usually show up in two forms: under-sharding and over-sharding.
Under-sharding happens when an index is created with too few shards, forcing each shard to hold a very large portion of the total data. This reduces parallelism across the cluster and makes heavy shards difficult to move or recover, limiting the system’s ability to balance load effectively.

Over-sharding occurs when an index is split into too many small shards. While it may look harmless, each shard behaves as a separate Lucene index and introduces its own resource cost.

Both extremes can degrade cluster performance in different ways, which is why choosing the correct shard count—aligned with expected data volume and workload patterns—is essential for predictable, healthy operation.

3. How Elasticsearch/OpenSearch Balances Shards

Elasticsearch and OpenSearch distribute shards across data nodes using a balancing strategy that focuses on shard count per node, not on the actual size of each shard. In simple terms, if your cluster has 20 shards and 5 data nodes, the allocator aims for roughly four shards on every node, regardless of whether those shards are tiny or hundreds of gigabytes. This approach keeps the cluster numerically balanced and easy to manage at scale, but it also introduces a subtle weakness: a cluster can look perfectly balanced in shard count while being deeply unbalanced in real workload.

4. The Hidden Limitation: Balanced Shard Count ≠ Balanced Workload

To see why shard-count balancing can be misleading, consider a common real-world pattern. Suppose Index A is under-sharded, with only two shards holding about 100 GB each, while Index B is over-sharded, split into ten shards of roughly 1 GB each. In a cluster with four data nodes, Elasticsearch/OpenSearch will try to distribute the total twelve shards so that each node holds about three shards. On paper, this seems ideal. However, once the shards land, two nodes end up hosting the massive 100 GB shards from Index A, while the remaining nodes host only tiny shards from Index B. The visual below makes this imbalance obvious:

Although every node holds the same number of shards, the true load is uneven. Nodes carrying the large shards experience heavier CPU usage, higher heap pressure, more intense merge activity, and greater GC stress, while nodes with only tiny shards remain relatively idle. The result is that query latency becomes inconsistent, recovery after failures takes longer, and the large-shard index turns into a hotspot that drags down overall cluster performance. This is the core trap: numerical balance does not guarantee workload balance, and engineers must design shard counts and sizes with this limitation in mind.

5. Conclusion

Shards define everything about your Elasticsearch/OpenSearch cluster — performance, reliability, cost, and long-term stability.
By understanding how shards are stored and — critically — how the cluster balances shard counts but not sizes, you can avoid silent performance killers.

DEV Community