DEV Community

Lucien Boix
Lucien Boix

Posted on • Edited on

2

ElasticSearch cluster first-aid kit

Here are some useful commands I used in the past to help you fix your yellow or red cluster, especially when you have unassigned shards. If you have suggestions for improvements please let me know in the comments. I wish you a great day !

# see cluster health
GET _cluster/health?pretty

# see nodes status
GET _cat/nodes?pretty&v=true

# see a summary of the JVM statistics (memory usage, does GC is triggering a lot, etc.) 
GET /_nodes/stats/jvm?pretty

# see shards status
GET /_cat/shards?v

# see shards allocation (useful to detect if a node has a disk space full)
GET /_cat/allocation?v

# get detailed reason for the first unassigned shard
GET /_cluster/allocation/explain

# get the reason for any unhealthy shard
GET _cat/shards?h=index,shard,prirep,state,unassigned.reason

# the detail of an unhealthy shard can be found here : https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-shards.html#_example_with_reasons_for_unassigned_shards:~:text=unassigned.-,reason,-%2C%20ur

# if the unassigned shard belongs to an index you can get rid of (logs of a past day for example), the easiest fix is to remove the related index
GET _cat/indices?v
DELETE /your_index

# if the unassigned shard belongs to an index you can NOT get rid of (production data), then try to reroute it to another node (if it fails the precise reason will be described) : example for primary shard #2 (use "allow_primary": false for a replica shard) of your-index (remove the ?dry_run parameter to actually reroute the shard)
POST _cluster/reroute?dry_run
{
    "commands" : [
        {
          "allocate" : {
              "index" : "your-index", "shard" : 2, "node" : "new-node-name", "allow_primary": true
          }
        }
    ]
}

# if you stuck shard is not in UNASSIGNED status but rather in INITIALIZING status
## if you are with ES7+ then you can force the reassignment of the shard with the command above, but replace allocate with allocate_stale (I never tested it myself actually, only read about this)
## if not and you are comfortable, you can try to reboot the node currently assigned to this shard : after the restart, the shard should be back to UNASSIGNED status and you will be able to use the command above (I never tested it myself actually, only read about this)

# check your cluster settings (allocation rules for example)
GET _cluster/settings

# exclude the IP of a bad node for the shard allocation
PUT _cluster/settings
{
  "transient" :{
      "cluster.routing.allocation.exclude._ip" : "your-node-ip"
  }
}

# check your index settings (shards and replicas number for example)
GET /your-index/_settings

# if you have a replica unassigned shard, a known workaround is to put to 0 the number of replicas (it will delete replica shards) then put it back to its original value (it will recreate them). But I recommend to AVOID doing this as it will put a big load on the cluster, and it's a risky procedure especially if the state of the cluster is red
PUT /your-index/_settings
{
    "index" : {
        "number_of_replicas" : 0
    }
}
Enter fullscreen mode Exit fullscreen mode

Image of Timescale

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read full post →

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more