Prithvi S

Posted on Apr 21 • Edited on Apr 27

Elasticsearch Cluster Health 101: Understanding Your Distributed System's Vital Signs

#elasticsearch #search #database #analytics

You ship your Elasticsearch cluster to production. Traffic hits it. Three hours later, your monitoring dashboard flashes yellow. Your heart sinks. What does that mean? Are you in trouble? Should you wake up the on-call engineer at 2 AM?

This post teaches you to read your cluster's health like a doctor reads vital signs. By the end, you'll understand what GREEN, YELLOW, and RED actually mean, why your cluster sometimes needs time to heal itself, and how to spot real problems before they become disasters.

What Is Cluster Health? The Three States

Every Elasticsearch cluster has a health status. It's not a guess. It's a concrete signal that tells you whether your data is safe.

GET /_cluster/health
{
  "cluster_name": "production",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": 5,
  "number_of_data_nodes": 3,
  "active_primary_shards": 12,
  "active_shards": 36,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100.0
}

GREEN means all is well. Every primary shard has a replica. Every index is fully replicated. You can sleep soundly.

YELLOW means something is missing, but it's not critical yet. Your data is still readable and searchable. Every primary shard exists. But not all replicas are assigned. This usually happens when you lose a node, and Elasticsearch hasn't finished rebalancing yet. You have time to fix it, but replicas are your safety net. If another node fails while you're yellow, you lose data.

RED means you're in trouble. At least one primary shard is missing. Data is gone or unreachable. Your cluster cannot fully serve requests. This is the emergency light. Time to act.

The Architecture Behind Health: Cluster Coordination

To understand why your cluster gets sick, you need to understand how it stays healthy.

Elasticsearch is fundamentally distributed. Your data is split across multiple nodes. Each node is independent. But they need to agree on one critical thing: where is my data? This is the job of the cluster coordinator (master node).

The Master Node: The Orchestrator

One node in your cluster is elected master. This node makes all the big decisions:

Where do shards live?
Is node X still alive, or did it fail?
When a node joins, where do its shards go?
Which indices can be created or deleted?

The master maintains the cluster state, a constantly updated map of the cluster. This map says: "Shard 0 of index-2026-04 is on node-1 (primary), node-2 (replica), and node-3 (replica)."

Why does this matter? Because if the master dies, the cluster needs to elect a new one. And if you don't have enough nodes to reach a quorum, the cluster freezes to prevent split-brain (where two masters disagree and corrupt your data).

Master Election: The Quorum Rule

Elasticsearch uses quorum voting. You need at least (N/2 + 1) master-eligible nodes to have a working master.

1 master-eligible node: no quorum, single point of failure
2 master-eligible nodes: no quorum (need 2/2 + 1 = 2, but split vote = tie)
3 master-eligible nodes: quorum = 2 (safe, survives 1 failure)
5 master-eligible nodes: quorum = 3 (safe, survives 2 failures)

This is why production clusters run 3 or 5 master-eligible nodes, not 2. A 2-node cluster can't form a quorum if one node fails. Your cluster goes RED and stops serving requests.

Recommended Production Setup:
- 3 master-eligible nodes (dedicated, small machines)
- 3+ data nodes (store and search data, large machines)
- 1+ coordinating nodes (optional, route queries, aggregate results)

This setup survives any single node failure.

Shard Allocation: How Data Spreads

Your index has 3 primary shards and 2 replicas per primary. That's 9 shards total (3 primary + 6 replica). Elasticsearch's job is to spread these 9 shards across your nodes so that:

No primary and replica on the same node (otherwise a single node failure loses data)
Replicas spread across different nodes (fault tolerance)
Load balanced (roughly equal shard count per node)

When everything works, this happens automatically. When a node fails, Elasticsearch:

Detects the failure (no heartbeat for 30 seconds)
Marks the node as dead
Reassigns its shards to other nodes
Creates new replicas to restore redundancy

This process is called rebalancing. It takes time. A large index might take minutes or hours to fully rebalance. During this time, your cluster is YELLOW (replicas missing), but still operational.

Common Health Scenarios

Scenario 1: New Node Joins

Before: 2 nodes, 6 shards each (fully replicated, GREEN)
New node joins
Action: Master rebalances, shards move to new node
During: YELLOW (shards initializing on new node)
After: GREEN (shards reassigned, balanced)
Timeline: Minutes to hours depending on shard size

Scenario 2: Node Failure

Before: 3 nodes, GREEN (all shards have replicas)
Node 2 crashes (network partition, power failure)
Immediately: YELLOW (node 2's shards gone, replicas missing)
Action: Master promotes replicas on nodes 1 and 3 to primary
       Creates new replicas on nodes 1 and 3
During: YELLOW (replicas initializing)
After: GREEN (all shards have replicas again)
Timeline: Seconds (replica promotion) + minutes (replica creation)

Scenario 3: Disk Full

Before: 3 nodes, GREEN
Node 1 disk reaches 85% capacity
Action: Elasticsearch refuses to assign new shards to node 1
Symptom: Some shards can't be assigned to node 1, cluster goes YELLOW
Fix: Delete old indices, or add disk space
After: Cluster rebalances, goes GREEN

Reading the Health Endpoint: What Each Field Means

The GET /_cluster/health API is your primary diagnostic tool. Here's what each field tells you:

Field	Meaning
`status`	GREEN (all good), YELLOW (missing replicas), RED (missing primary)
`number_of_nodes`	Total nodes in cluster
`number_of_data_nodes`	Nodes that store data
`active_primary_shards`	Primary shards assigned and healthy
`active_shards`	Primary + replica shards assigned and healthy
`relocating_shards`	Shards currently moving to another node
`initializing_shards`	Shards being created or recovered
`unassigned_shards`	Shards that haven't been assigned to a node
`delayed_unassigned_shards`	Shards waiting to be assigned (temporary delay)
`number_of_pending_tasks`	Master tasks waiting to be executed

Example: Degraded Cluster

{
  "status": "yellow",
  "number_of_nodes": 3,
  "active_primary_shards": 12,
  "active_shards": 24,
  "unassigned_shards": 12,
  "relocating_shards": 2,
  "initializing_shards": 4
}

Translation: 3 nodes, 12 primary shards assigned, but only 24 total shards assigned. That means 12 replicas are missing (unassigned). Also, 2 shards are moving, 4 are initializing. The cluster is rebalancing from a recent failure or node addition.

Diagnosing RED: Data Is Missing

A RED cluster means at least one primary shard has no home. This is an emergency.

Find the problematic shard:

GET /_cat/shards?health=red&v

This shows you which shards are unassigned. Look for entries with no node assignment.

Common causes:

Node failure with insufficient replicas - If a node fails and you had zero replicas, the primary shard is lost
- Fix: Restore from snapshot
Disk full on all nodes - Elasticsearch won't assign shards to nodes >85% full
- Fix: Delete old indices, add disk space, or adjust disk threshold setting
Allocation disabled - Someone (usually during disaster recovery) disabled shard allocation
- Fix: Re-enable with PUT /_cluster/settings {"transient": {"cluster.routing.allocation.enable": "all"}}
Too many relocating shards - Master is overloaded trying to rebalance
- Fix: Wait, or reduce concurrent recoveries with cluster settings

Diagnosing YELLOW: Replicas Are Missing

YELLOW is a warning, not a failure. You can still read and write. But you're one node failure away from RED.

Check which indices are yellow:

GET /_cat/indices?health=yellow&v

Check if it's stuck or still rebalancing:

GET /_cluster/health?wait_for_status=green&timeout=5m

This waits up to 5 minutes for the cluster to reach GREEN. If it times out, you're stuck yellow.

Why you might be stuck yellow:

Insufficient nodes - You have 1 data node but 2 replicas per shard. No place to put the replicas
- Fix: Add more nodes
Allocation disabled - Replicas won't be assigned if allocation is off
- Fix: PUT /_cluster/settings {"transient": {"cluster.routing.allocation.enable": "all"}}
Allocation filters blocking replicas - You set a filter that prevents replicas on certain nodes
- Fix: Review allocation filtering rules

Monitoring: Don't Just React, Anticipate

Cluster health is reactive. It tells you what happened, not what will happen. For reliability, monitor proactively:

Alert on these:

Status == RED (obvious, immediate incident)
Status == YELLOW for >5 minutes (stuck rebalancing, investigate)
Unassigned shards > 0 for >10 minutes
Disk usage >85% on any data node
Heap usage >80% on any node
Relocating shards > 5 (recovery is slow)

Useful dashboard queries:

GET /_cat/nodes?v&h=name,ip,heap.percent,disk.percent,cpu,load_1m

This shows node-by-node health: heap usage, disk usage, CPU, load. Red flags: heap >80%, disk >85%.

GET /_nodes/stats/jvm,fs,indices

Deep dive: garbage collection pauses, segment count, cache hit rates. Useful for performance issues hiding behind a GREEN cluster.

Common Mistakes That Destroy Reliability

Mistake 1: Single master-eligible node

You think it works fine until that one node fails
Cluster elects no master, goes RED, stops serving requests
Fix: Always run 3+ master-eligible nodes

Mistake 2: Ignoring YELLOW for days

"It's yellow, but traffic is fine!" you say
Then a second node fails, cluster goes RED
Fix: Investigate YELLOW immediately, restore replicas

Mistake 3: All shards on one node

You didn't specify replicas or shard allocation rules
One node failure = RED cluster
Fix: Use allocation awareness, rack awareness, or zone awareness

Mistake 4: Disabling shard allocation and forgetting to re-enable

You disabled it during maintenance and moved on
Weeks later, replicas are still unassigned
Fix: Audit allocation settings regularly

Mistake 5: Not understanding recovery time

You expect replicas to be created instantly
But recovery depends on network bandwidth, index size, merge rate
You panic and manually delete/recreate indices, making it worse
Fix: Understand recovery SLOs for your cluster size

Putting It Together: Your First Cluster Health Audit

Here's what to do right now:

# Check overall health
curl http://localhost:9200/_cluster/health?pretty

# If not GREEN, check which indices are affected
curl http://localhost:9200/_cat/indices?health=yellow&v

# Check which shards are unassigned
curl http://localhost:9200/_cat/shards?health=yellow&v

# Check node status
curl http://localhost:9200/_cat/nodes?v&h=name,ip,heap.percent,disk.percent,cpu

# Check allocation settings
curl http://localhost:9200/_cluster/settings?pretty

If you see YELLOW and unassigned shards, it's usually one of these:

A node is recovering (wait 5-10 min, check again)
You don't have enough nodes for your replica count (add nodes)
Disk is full (delete old data)
Allocation is disabled (re-enable it)

Conclusion: Health Is Visibility

Cluster health is not a number to ignore. It's your window into the distributed system running underneath your search and analytics.

GREEN means you're safe. YELLOW means you're vulnerable. RED means you have a real problem.

The key insight: Elasticsearch recovers automatically most of the time. Your job is to understand what's happening, monitor proactively, and know when to intervene.

Next step: Learn about shard allocation strategies and how to scale your cluster without triggering cascading failures.

About the Author

I'm Prithvi S, Staff Software Engineer at Cloudera and Open Source Enthusiast. Follow my work on GitHub: https://github.com/iprithv

DEV Community