DEV Community

Lucien Boix
Lucien Boix

Posted on • Edited on

ElasticSearch cluster first-aid kit

Here are some useful commands I used in the past to help you fix your yellow or red cluster, especially when you have unassigned shards. If you have suggestions for improvements please let me know in the comments. I wish you a great day !

# see cluster health
GET _cluster/health?pretty

# see nodes status
GET _cat/nodes?pretty&v=true

# see a summary of the JVM statistics (memory usage, does GC is triggering a lot, etc.) 
GET /_nodes/stats/jvm?pretty

# see shards status
GET /_cat/shards?v

# see shards allocation (useful to detect if a node has a disk space full)
GET /_cat/allocation?v

# get detailed reason for the first unassigned shard
GET /_cluster/allocation/explain

# get the reason for any unhealthy shard
GET _cat/shards?h=index,shard,prirep,state,unassigned.reason

# the detail of an unhealthy shard can be found here : https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-shards.html#_example_with_reasons_for_unassigned_shards:~:text=unassigned.-,reason,-%2C%20ur

# if the unassigned shard belongs to an index you can get rid of (logs of a past day for example), the easiest fix is to remove the related index
GET _cat/indices?v
DELETE /your_index

# if the unassigned shard belongs to an index you can NOT get rid of (production data), then try to reroute it to another node (if it fails the precise reason will be described) : example for primary shard #2 (use "allow_primary": false for a replica shard) of your-index (remove the ?dry_run parameter to actually reroute the shard)
POST _cluster/reroute?dry_run
{
    "commands" : [
        {
          "allocate" : {
              "index" : "your-index", "shard" : 2, "node" : "new-node-name", "allow_primary": true
          }
        }
    ]
}

# if you stuck shard is not in UNASSIGNED status but rather in INITIALIZING status
## if you are with ES7+ then you can force the reassignment of the shard with the command above, but replace allocate with allocate_stale (I never tested it myself actually, only read about this)
## if not and you are comfortable, you can try to reboot the node currently assigned to this shard : after the restart, the shard should be back to UNASSIGNED status and you will be able to use the command above (I never tested it myself actually, only read about this)

# check your cluster settings (allocation rules for example)
GET _cluster/settings

# exclude the IP of a bad node for the shard allocation
PUT _cluster/settings
{
  "transient" :{
      "cluster.routing.allocation.exclude._ip" : "your-node-ip"
  }
}

# check your index settings (shards and replicas number for example)
GET /your-index/_settings

# if you have a replica unassigned shard, a known workaround is to put to 0 the number of replicas (it will delete replica shards) then put it back to its original value (it will recreate them). But I recommend to AVOID doing this as it will put a big load on the cluster, and it's a risky procedure especially if the state of the cluster is red
PUT /your-index/_settings
{
    "index" : {
        "number_of_replicas" : 0
    }
}
Enter fullscreen mode Exit fullscreen mode

Top comments (0)