ElasticSearch cluster sanity check and first-aid kit

Here are some useful commands I used in the past to help you fix your yellow or red cluster, especially when you have unassigned shards. If you have suggestions for improvements please let me know in the comments. I wish you a great day !

# see cluster health
GET _cluster/health?pretty

# see nodes status
GET _cat/nodes?pretty&v=true

# see a summary of the JVM statistics (memory usage, does GC is triggering a lot, etc.) 
GET /_nodes/stats/jvm?pretty

# see shards status
GET /_cat/shards?v

# see shards allocation (useful to detect if a node has a disk space full)
GET /_cat/allocation?v

# get detailed reason for the first unassigned shard
GET /_cluster/allocation/explain

# get the reason for any unhealthy shard
GET _cat/shards?h=index,shard,prirep,state,unassigned.reason

# the detail of an unhealthy shard can be found here : https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-shards.html#_example_with_reasons_for_unassigned_shards:~:text=unassigned.-,reason,-%2C%20ur

# if the unassigned shard belongs to an index you can get rid of (logs of a past day for example), the easiest fix is to remove the related index
GET _cat/indices?v
DELETE /your_index

# if the unassigned shard belongs to an index you can NOT get rid of (production data), then try to reroute it to another node (if it fails the precise reason will be described) : example for primary shard #2 (use "allow_primary": false for a replica shard) of your-index (remove the ?dry_run parameter to actually reroute the shard)
POST _cluster/reroute?dry_run
{
    "commands" : [
        {
          "allocate" : {
              "index" : "your-index", "shard" : 2, "node" : "new-node-name", "allow_primary": true
          }
        }
    ]
}

# if you stuck shard is not in UNASSIGNED status but rather in INITIALIZING status
## if you are with ES7+ then you can force the reassignment of the shard with the command above, but replace allocate with allocate_stale (I never tested it myself actually, only read about this)
## if not and you are comfortable, you can try to reboot the node currently assigned to this shard : after the restart, the shard should be back to UNASSIGNED status and you will be able to use the command above (I never tested it myself actually, only read about this)

# check your cluster settings (allocation rules for example)
GET _cluster/settings

# exclude the IP of a bad node for the shard allocation
PUT _cluster/settings
{
  "transient" :{
      "cluster.routing.allocation.exclude._ip" : "your-node-ip"
  }
}

# check your index settings (shards and replicas number for example)
GET /your-index/_settings

# if you have a replica unassigned shard, a known workaround is to put to 0 the number of replicas (it will delete replica shards) then put it back to its original value (it will recreate them). But I recommend to AVOID doing this as it will put a big load on the cluster, and it's a risky procedure especially if the state of the cluster is red
PUT /your-index/_settings
{
    "index" : {
        "number_of_replicas" : 0
    }
}

Top comments (1)

Lorraine • May 26

Creating a digital assistant using the Lyzr SDK for first aid is a great example of how tech can support real-world needs. In emergencies, having quick access to accurate information can genuinely make a difference. However, while this sort of tool can be incredibly helpful, it’s important to remember that it doesn’t replace proper first aid training. Knowing how to respond in the moment – whether it’s CPR, treating burns, or managing a choking incident – can save lives. That’s why combining these innovative tools with practical knowledge is key. The assistant could be a useful refresher or guide in the moment, especially for those who’ve already completed a first aid course. It’s exciting to see tech being used in such a practical, potentially life-saving way.