Elasticsearch Cluster Health 101: Understanding, Monitoring, and Maintaining Your Cluster
Author: Prithvi S, Staff Software Engineer at Cloudera and Open‑source Enthusiast
Introduction
You ship your Elasticsearch cluster to production. Traffic spikes. Suddenly your dashboard flashes YELLOW. What does that mean? Are you about to lose data? Can you keep the service running? This guide teaches you how to read your cluster’s health signals, diagnose problems, and keep your data safe. It focuses on the architecture and coordination aspects of Elasticsearch, not on the low‑level search mechanics.
What Is Cluster Health?
Cluster health is the collective state of all nodes, shards, and data replication in an Elasticsearch cluster. Elasticsearch reports three health levels:
- GREEN – All primary and replica shards are allocated and active.
- YELLOW – All primary shards are active, but some replica shards are missing.
- RED – One or more primary shards are unassigned, meaning some data is not searchable.
These states tell you whether the cluster can serve read requests, whether it has redundancy, and whether it can recover from failures.
Under the Hood – Cluster Coordination
Master‑eligible Nodes
The master‑eligible nodes maintain the cluster state – a compact JSON document that records shard locations, node metadata, and index settings. Only master‑eligible nodes can become the master node that orchestrates shard allocation and rebalancing.
Master Election
Elasticsearch uses a quorum‑based voting algorithm. A master is elected when a majority of master‑eligible nodes agree on the same node. This prevents split‑brain scenarios where two partitions each think they own the master.
Discovery
When a node starts, it contacts the configured discovery.seed_hosts (or uses the default unicast list) to locate other nodes. Once a master is found, the node receives the latest cluster state and registers itself.
Example Topology
Node A – master‑eligible
Node B – master‑eligible
Node C – master‑eligible
Node D – data
Node E – data
Node F – coordinating (client) – no data
The three master‑eligible nodes form a quorum. If one master fails, the remaining two can elect a new master without interruption.
Shard Allocation & Rebalancing
Primary and Replica Shards
- Primary shard – receives write operations first.
- Replica shard – read‑only copy of a primary; provides redundancy and additional query capacity.
Allocation Rules
Elasticsearch tries to spread shards across nodes to avoid a single point of failure. By default, it places at most one shard (primary or replica) of the same index on a node.
Unassigned Shards
A shard becomes unassigned when there is no node that satisfies its allocation rules. Common reasons:
- The node that held the shard is down.
- Disk usage on all nodes exceeds the
cluster.routing.allocation.disk.threshold_enabledlimit. - Allocation is manually disabled via settings.
Rebalancing
When a node fails, the master promotes a replica to primary and creates a new replica on another node. This process is asynchronous; it may take seconds to minutes depending on shard size and network bandwidth.
Allocation Awareness
You can tag nodes with attributes like rack, zone, or custom labels. Allocation filters (cluster.routing.allocation.awareness.attributes) ensure that replicas are spread across these domains, improving fault tolerance.
Reading the Health Endpoint
The simplest way to check cluster health is the GET /_cluster/health API.
GET /_cluster/health
{
"cluster_name": "my‑cluster",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 6,
"number_of_data_nodes": 4,
"active_primary_shards": 120,
"active_shards": 240,
"relocating_shards": 2,
"initializing_shards": 0,
"unassigned_shards": 4,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 3,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 96.0
}
Key fields:
-
status: overall health (green,yellow,red). -
active_primary_shards: number of primary shards that are active. -
unassigned_shards: shards that have no assigned node; a non‑zero value is a red flag for primary shards. -
relocating_shards: shards moving to a new node during rebalancing.
You can wait for a specific health level before proceeding with operations:
GET /_cluster/health?wait_for_status=green&timeout=30s
If the cluster does not reach green within the timeout, the request returns a warning.
Diagnosing Common Issues
Red Cluster – Primary Shard Unassigned
Symptoms: status: red, unassigned_shards > 0.
Root causes:
- Node failure (hardware crash, network loss).
- Disk full on all nodes, preventing allocation.
- Allocation disabled (
cluster.routing.allocation.enable: none). Steps to fix: - Identify the missing primary shard:
GET /_cat/shards?v&s=state&h=index,shard,prirep,state,unassigned.reason
- Check node health and disk usage:
GET /_cat/nodes?v&h=name,ip,heap.percent,cpu,load_1m,disk.avail
- Free disk space or add a new node.
- Enable allocation if it was disabled.
- Run
_cluster/rerouteto force allocation if needed.
Yellow Cluster – Replica Shards Missing
Symptoms: status: yellow, active_shards equals active_primary_shards but active_shards_percent_as_number < 100.
Why it’s okay to serve queries: All primary shards are active, so data is readable. However you lack redundancy; a second node failure could cause data loss.
Fixes:
- Wait for automatic rebalancing; new replicas are created as nodes become available.
- Increase the number of data nodes or add capacity.
- Check for allocation filters that might be preventing replica placement.
High CPU / Memory with Green Status
Observation: Cluster reports green but query latency is high.
Actions:
- Examine shard size distribution:
GET /_cat/shards?h=index,shard,prirep,size,node - Look for hot shards on a single node.
- Review query patterns; use
profileAPI to find slow queries.
Key Diagnostic Commands
| Command | Purpose |
|---|---|
GET /_cluster/health |
Overall health snapshot |
GET /_cluster/health?wait_for_status=green&timeout=50s |
Block until cluster is green |
GET /_cat/nodes?v=true&h=name,ip,heap.percent,cpu,load_1m |
Node‑level resource usage |
GET /_cat/shards?health=red |
List shards causing red status |
GET /_cat/shards?health=yellow |
List shards pending replica allocation |
GET /_cat/indices?health=red |
Identify indices with missing primaries |
GET /_nodes/stats |
Detailed JVM, OS, and thread pool stats |
GET /_cluster/allocation/explain |
Explain why a particular shard is not allocated |
Monitoring & Alerting Strategy
What to Alert On
- Cluster status red – immediate pager.
- Unassigned primary shards > 0 – critical, may indicate data loss.
- Disk usage > 85% on any data node – block new shard allocation.
- Heap usage > 80% – risk of out‑of‑memory errors.
- Relocating shards > 10 for > 5 minutes – possible rebalancing stall.
Tools
-
Kibana dashboards – visualize
cluster_health,node_stats, andshard_allocation. - Elastic Stack Alerting – create Watcher alerts that send Slack or email.
-
Prometheus + Grafana – scrape
/metricsendpoint via the Exporter.
Sample Alert Rule (Kibana)
"condition": {
"script": {
"source": "params.status == 'red' || params.unassigned_shards > 0",
"params": {
"status": "{{ctx.payload.status}}",
"unassigned_shards": {{ctx.payload.unassigned_shards}}
}
}
}
When the rule fires, you receive a notification with the cluster name and a link to the health API.
Common Mistakes & Fixes
- Single master node – no quorum; a network partition can split the cluster. Fix: Deploy at least three master‑eligible nodes.
- Ignoring yellow status – you lose redundancy. Fix: Treat yellow as a warning and add capacity or wait for rebalancing.
- All shards on one node – no fault tolerance. Fix: Use default allocation rules or explicit awareness attributes.
-
Allocation disabled after a failure – shards stay unassigned. Fix: Re‑enable allocation (
PUT /_cluster/settings). - Oversharding – too many small shards increase cluster state size and recovery time. Fix: Aim for 10‑50 GB per shard.
Conclusion
Cluster health is your safety net. Green means everything is replicated; yellow signals missing replicas; red means primary data is missing. Understanding the master election, shard allocation, and rebalancing process helps you react quickly when something goes wrong. Monitor the health API, set alerts on critical metrics, and keep an eye on disk and heap usage. With these practices you can keep your Elasticsearch deployment reliable and resilient.
Images
Alt text: Elasticsearch cluster architecture with master, data, and coordinating nodes
Alt text: Kibana dashboard showing cluster health status and shard allocation
References
- Elasticsearch Documentation – Cluster APIs: https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster.html
- Elastic Stack Monitoring Guide: https://www.elastic.co/guide/en/observability/current/monitoring.html
- “Elasticsearch Cluster Health 101” – Dev.to (previous post) – https://dev.to/iprithv/elasticsearch-cluster-health-101-understanding-your-distributed-systems-vital-signs-1kl6
Top comments (0)