Elasticsearch Cluster Health 101: Understanding Your Distributed System's Vital Signs
You ship your Elasticsearch cluster to production. Traffic hits it. Three hours later, your monitoring dashboard flashes yellow. Your heart sinks. What does that mean? Are you in trouble? Should you wake up the on-call engineer at 2 AM?
This post teaches you to read your cluster's health like a doctor reads vital signs. By the end, you'll understand what GREEN, YELLOW, and RED actually mean, why your cluster sometimes needs time to heal itself, and how to spot real problems before they become disasters.
What Is Cluster Health? The Three States
Every Elasticsearch cluster has a health status. It's not a guess. It's a concrete signal that tells you whether your data is safe.
GET /_cluster/health
{
"cluster_name": "production",
"status": "green",
"timed_out": false,
"number_of_nodes": 5,
"number_of_data_nodes": 3,
"active_primary_shards": 12,
"active_shards": 36,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 0,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 100.0
}
GREEN means all is well. Every primary shard has a replica. Every index is fully replicated. You can sleep soundly.
YELLOW means something is missing, but it's not critical yet. Your data is still readable and searchable. Every primary shard exists. But not all replicas are assigned. This usually happens when you lose a node, and Elasticsearch hasn't finished rebalancing yet. You have time to fix it, but replicas are your safety net. If another node fails while you're yellow, you lose data.
RED means you're in trouble. At least one primary shard is missing. Data is gone or unreachable. Your cluster cannot fully serve requests. This is the emergency light. Time to act.
The Architecture Behind Health: Cluster Coordination
To understand why your cluster gets sick, you need to understand how it stays healthy.
Elasticsearch is fundamentally distributed. Your data is split across multiple nodes. Each node is independent. But they need to agree on one critical thing: where is my data? This is the job of the cluster coordinator (master node).
The Master Node: The Orchestrator
One node in your cluster is elected master. This node makes all the big decisions:
- Where do shards live?
- Is node X still alive, or did it fail?
- When a node joins, where do its shards go?
- Which indices can be created or deleted?
The master maintains the cluster state, a constantly updated map of the cluster. This map says: "Shard 0 of index-2026-04 is on node-1 (primary), node-2 (replica), and node-3 (replica)."
Why does this matter? Because if the master dies, the cluster needs to elect a new one. And if you don't have enough nodes to reach a quorum, the cluster freezes to prevent split-brain (where two masters disagree and corrupt your data).
Master Election: The Quorum Rule
Elasticsearch uses quorum voting. You need at least (N/2 + 1) master-eligible nodes to have a working master.
- 1 master-eligible node: no quorum, single point of failure
- 2 master-eligible nodes: no quorum (need 2/2 + 1 = 2, but split vote = tie)
- 3 master-eligible nodes: quorum = 2 (safe, survives 1 failure)
- 5 master-eligible nodes: quorum = 3 (safe, survives 2 failures)
This is why production clusters run 3 or 5 master-eligible nodes, not 2. A 2-node cluster can't form a quorum if one node fails. Your cluster goes RED and stops serving requests.
Recommended Production Setup:
- 3 master-eligible nodes (dedicated, small machines)
- 3+ data nodes (store and search data, large machines)
- 1+ coordinating nodes (optional, route queries, aggregate results)
This setup survives any single node failure.
Shard Allocation: How Data Spreads
Your index has 3 primary shards and 2 replicas per primary. That's 9 shards total (3 primary + 6 replica). Elasticsearch's job is to spread these 9 shards across your nodes so that:
- No primary and replica on the same node (otherwise a single node failure loses data)
- Replicas spread across different nodes (fault tolerance)
- Load balanced (roughly equal shard count per node)
When everything works, this happens automatically. When a node fails, Elasticsearch:
- Detects the failure (no heartbeat for 30 seconds)
- Marks the node as dead
- Reassigns its shards to other nodes
- Creates new replicas to restore redundancy
This process is called rebalancing. It takes time. A large index might take minutes or hours to fully rebalance. During this time, your cluster is YELLOW (replicas missing), but still operational.
Common Health Scenarios
Scenario 1: New Node Joins
Before: 2 nodes, 6 shards each (fully replicated, GREEN)
New node joins
Action: Master rebalances, shards move to new node
During: YELLOW (shards initializing on new node)
After: GREEN (shards reassigned, balanced)
Timeline: Minutes to hours depending on shard size
Scenario 2: Node Failure
Before: 3 nodes, GREEN (all shards have replicas)
Node 2 crashes (network partition, power failure)
Immediately: YELLOW (node 2's shards gone, replicas missing)
Action: Master promotes replicas on nodes 1 and 3 to primary
Creates new replicas on nodes 1 and 3
During: YELLOW (replicas initializing)
After: GREEN (all shards have replicas again)
Timeline: Seconds (replica promotion) + minutes (replica creation)
Scenario 3: Disk Full
Before: 3 nodes, GREEN
Node 1 disk reaches 85% capacity
Action: Elasticsearch refuses to assign new shards to node 1
Symptom: Some shards can't be assigned to node 1, cluster goes YELLOW
Fix: Delete old indices, or add disk space
After: Cluster rebalances, goes GREEN
Reading the Health Endpoint: What Each Field Means
The GET /_cluster/health API is your primary diagnostic tool. Here's what each field tells you:
| Field | Meaning |
|---|---|
status |
GREEN (all good), YELLOW (missing replicas), RED (missing primary) |
number_of_nodes |
Total nodes in cluster |
number_of_data_nodes |
Nodes that store data |
active_primary_shards |
Primary shards assigned and healthy |
active_shards |
Primary + replica shards assigned and healthy |
relocating_shards |
Shards currently moving to another node |
initializing_shards |
Shards being created or recovered |
unassigned_shards |
Shards that haven't been assigned to a node |
delayed_unassigned_shards |
Shards waiting to be assigned (temporary delay) |
number_of_pending_tasks |
Master tasks waiting to be executed |
Example: Degraded Cluster
{
"status": "yellow",
"number_of_nodes": 3,
"active_primary_shards": 12,
"active_shards": 24,
"unassigned_shards": 12,
"relocating_shards": 2,
"initializing_shards": 4
}
Translation: 3 nodes, 12 primary shards assigned, but only 24 total shards assigned. That means 12 replicas are missing (unassigned). Also, 2 shards are moving, 4 are initializing. The cluster is rebalancing from a recent failure or node addition.
Diagnosing RED: Data Is Missing
A RED cluster means at least one primary shard has no home. This is an emergency.
Find the problematic shard:
GET /_cat/shards?health=red&v
This shows you which shards are unassigned. Look for entries with no node assignment.
Common causes:
-
Node failure with insufficient replicas - If a node fails and you had zero replicas, the primary shard is lost
- Fix: Restore from snapshot
-
Disk full on all nodes - Elasticsearch won't assign shards to nodes >85% full
- Fix: Delete old indices, add disk space, or adjust disk threshold setting
-
Allocation disabled - Someone (usually during disaster recovery) disabled shard allocation
- Fix: Re-enable with
PUT /_cluster/settings {"transient": {"cluster.routing.allocation.enable": "all"}}
- Fix: Re-enable with
-
Too many relocating shards - Master is overloaded trying to rebalance
- Fix: Wait, or reduce concurrent recoveries with cluster settings
Diagnosing YELLOW: Replicas Are Missing
YELLOW is a warning, not a failure. You can still read and write. But you're one node failure away from RED.
Check which indices are yellow:
GET /_cat/indices?health=yellow&v
Check if it's stuck or still rebalancing:
GET /_cluster/health?wait_for_status=green&timeout=5m
This waits up to 5 minutes for the cluster to reach GREEN. If it times out, you're stuck yellow.
Why you might be stuck yellow:
-
Insufficient nodes - You have 1 data node but 2 replicas per shard. No place to put the replicas
- Fix: Add more nodes
-
Allocation disabled - Replicas won't be assigned if allocation is off
- Fix:
PUT /_cluster/settings {"transient": {"cluster.routing.allocation.enable": "all"}}
- Fix:
-
Allocation filters blocking replicas - You set a filter that prevents replicas on certain nodes
- Fix: Review allocation filtering rules
Monitoring: Don't Just React, Anticipate
Cluster health is reactive. It tells you what happened, not what will happen. For reliability, monitor proactively:
Alert on these:
- Status == RED (obvious, immediate incident)
- Status == YELLOW for >5 minutes (stuck rebalancing, investigate)
- Unassigned shards > 0 for >10 minutes
- Disk usage >85% on any data node
- Heap usage >80% on any node
- Relocating shards > 5 (recovery is slow)
Useful dashboard queries:
GET /_cat/nodes?v&h=name,ip,heap.percent,disk.percent,cpu,load_1m
This shows node-by-node health: heap usage, disk usage, CPU, load. Red flags: heap >80%, disk >85%.
GET /_nodes/stats/jvm,fs,indices
Deep dive: garbage collection pauses, segment count, cache hit rates. Useful for performance issues hiding behind a GREEN cluster.
Common Mistakes That Destroy Reliability
Mistake 1: Single master-eligible node
- You think it works fine until that one node fails
- Cluster elects no master, goes RED, stops serving requests
- Fix: Always run 3+ master-eligible nodes
Mistake 2: Ignoring YELLOW for days
- "It's yellow, but traffic is fine!" you say
- Then a second node fails, cluster goes RED
- Fix: Investigate YELLOW immediately, restore replicas
Mistake 3: All shards on one node
- You didn't specify replicas or shard allocation rules
- One node failure = RED cluster
- Fix: Use allocation awareness, rack awareness, or zone awareness
Mistake 4: Disabling shard allocation and forgetting to re-enable
- You disabled it during maintenance and moved on
- Weeks later, replicas are still unassigned
- Fix: Audit allocation settings regularly
Mistake 5: Not understanding recovery time
- You expect replicas to be created instantly
- But recovery depends on network bandwidth, index size, merge rate
- You panic and manually delete/recreate indices, making it worse
- Fix: Understand recovery SLOs for your cluster size
Putting It Together: Your First Cluster Health Audit
Here's what to do right now:
# Check overall health
curl http://localhost:9200/_cluster/health?pretty
# If not GREEN, check which indices are affected
curl http://localhost:9200/_cat/indices?health=yellow&v
# Check which shards are unassigned
curl http://localhost:9200/_cat/shards?health=yellow&v
# Check node status
curl http://localhost:9200/_cat/nodes?v&h=name,ip,heap.percent,disk.percent,cpu
# Check allocation settings
curl http://localhost:9200/_cluster/settings?pretty
If you see YELLOW and unassigned shards, it's usually one of these:
- A node is recovering (wait 5-10 min, check again)
- You don't have enough nodes for your replica count (add nodes)
- Disk is full (delete old data)
- Allocation is disabled (re-enable it)
Conclusion: Health Is Visibility
Cluster health is not a number to ignore. It's your window into the distributed system running underneath your search and analytics.
GREEN means you're safe. YELLOW means you're vulnerable. RED means you have a real problem.
The key insight: Elasticsearch recovers automatically most of the time. Your job is to understand what's happening, monitor proactively, and know when to intervene.
Next step: Learn about shard allocation strategies and how to scale your cluster without triggering cascading failures.
About the Author
I'm Prithvi S, Staff Software Engineer at Cloudera and Open Source Enthusiast. Follow my work on GitHub: https://github.com/iprithv
Top comments (0)