DEV Community

alok shankar
alok shankar

Posted on

🚨 Elasticsearch High CPU Issue Due to Memory Pressure – Real Production Incident & Fix

πŸ” Introduction

Running Elasticsearch in production requires deep visibility into CPU, memory, shards, and cluster health.

One of the most confusing scenarios DevOps engineers face is:

⚠️ High CPU alerts, but CPU usage looks normal

In this blog, I’ll walk you through a real production incident where:

Elasticsearch triggered CPU alerts
But the actual root cause was memory pressure + shard imbalance + node failure

We’ll cover:

  1. Core Elasticsearch concepts
  2. Real logs and debugging steps
  3. Root cause analysis
  4. Production fix

πŸ“˜ Important Elasticsearch Concepts

Before diving into the issue, let’s understand some key building blocks.

πŸ“¦ How Elasticsearch Stores Data

Elasticsearch stores data as documents, grouped into an index.

However, when data grows large (billions/trillions of records), a single index cannot be stored efficiently on one node.

πŸ”Ή What is an Index?

An Index is:

  1. A collection of documents
  2. Logical partition of data
  3. Similar to a database

πŸ‘‰ Example:

  1. metricbeat-*
  2. .monitoring-*
  3. user-data

πŸ”Ή What are Shards?

To scale horizontally, Elasticsearch splits an index into shards.

  1. Each shard is a small unit of data
  2. Stored across multiple nodes
  3. Acts like a mini-index

βš™οΈ Why Shards Matter
βœ… Scalability β†’ Data distributed across nodes
βœ… Performance β†’ Parallel query execution
βœ… Availability β†’ Supports failover

πŸ” Primary vs Replica Shards

  1. Primary Shard β†’ Original data
  2. Replica Shard β†’ Copy for fault tolerance

🚨 Cluster Health Status
🟒 Green β†’ All shards assigned
🟑 Yellow β†’ Replica shards missing
πŸ”΄ Red β†’ Primary shards missing

🧠 JVM & Memory Basics

Elasticsearch runs on JVM:

  1. Heap memory is critical
  2. High usage β†’ Garbage Collection (GC)
  3. GC β†’ CPU spikes

⚠️ Production Issue Overview

We received alerts for:

πŸ”΄ High CPU usage
⚠️ Cluster health degraded
πŸ“‰ Slow search performance

πŸ“Š Investigation & Debugging

πŸ” Step 1: Cluster Health Check

[ec2-user@ip-x-x-x-x ~]$ curl -X GET "localhost:9200/_cluster/health?pretty"
{
  "cluster_name" : "web-test",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 5,
  "number_of_data_nodes" : 5,
  "active_primary_shards" : 247,
  "active_shards" : 343,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 193,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 63.99253731343284
}
Enter fullscreen mode Exit fullscreen mode
[ec2-user@ip-x-x-x-x ~]$ curl -X GET "localhost:9200/_cluster/health?filter_path=status,*_shards&pretty"
{
  "status" : "yellow",
  "active_primary_shards" : 247,
  "active_shards" : 343,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 193,
  "delayed_unassigned_shards" : 0
}
Enter fullscreen mode Exit fullscreen mode

πŸ‘‰ Key Insight:

193 unassigned shards β†’ Major issue

πŸ” Step 2: Node Resource Usage

[ec2-user@ip-x-x-x-x ~]$ curl -X GET "localhost:9200/_cat/nodes?v=true&s=cpu:desc&pretty"
ip         heap.percent ram.percent cpu load_1m load_5m load_15m node.role   master name
1x.x.x.2x9           73          97   3    0.19    0.16     0.11 cdfhilmrstw -      node-5
1x.x.x.8x            77          90   2    0.03    0.06     0.03 cdfhilmrstw *      node-1
1x.x.x.x            60          84   1    0.22    0.65     0.72 cdfhilmrstw -      node-3
1x.x.x.x            46          90   1    0.03    0.06     0.01 cdfhilmrstw -      node-4
1x.x.x.x            65          91   0    0.01    0.03     0.00 cdfhilmrstw -      node-2
Enter fullscreen mode Exit fullscreen mode

Observation:

  1. CPU: 0–5% (low)
  2. RAM: 88–97% (very high)

πŸ‘‰ This is critical:

CPU alert was misleading β€” actual issue was memory pressure

πŸ” Step 3: OS-Level Analysis

top
Enter fullscreen mode Exit fullscreen mode
[ec2-user@ip-x-x-x-xx ~]$ top
top - 10:57:46 up 13 days, 22:42,  1 user,  load average: 0.77, 0.73, 0.60
Tasks: 114 total,   1 running,  64 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.3 us,  0.1 sy,  0.0 ni, 97.6 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  7863696 total,   744000 free,  5938932 used,  1180764 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  2202220 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 3743 elastic+  20   0   48.0g   4.9g  36368 S   8.7 65.7   7078:50 java
    1 root      20   0  117520   5144   3408 S   0.0  0.1  22:27.92 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.25 kthreadd
    4 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/0:0H
    6 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 mm_percpu_wq
    7 root      20   0       0      0      0 S   0.0  0.0   0:13.95 ksoftirqd/0
    8 root      20   0       0      0      0 I   0.0  0.0   2:29.56 rcu_sched
    9 root      20   0       0      0      0 I   0.0  0.0   0:00.00 rcu_bh
   10 root      rt   0       0      0      0 S   0.0  0.0   0:02.68 migration/0
   11 root      rt   0       0      0      0 S   0.0  0.0   0:01.54 watchdog/0
   12 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/0
   13 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/1
   14 root      rt   0       0      0      0 S   0.0  0.0   0:01.63 watchdog/1
Enter fullscreen mode Exit fullscreen mode

Findings:
Java process:

  1. ~4.9 GB memory usage
  2. ~65% system memory

πŸ‘‰ Elasticsearch consuming most resources

πŸ” Step 4: JVM Memory Pressure

curl -X GET "_nodes/stats?filter_path=nodes.*.jvm.mem.pools.old"
Enter fullscreen mode Exit fullscreen mode

Observation:

  1. High old-gen memory usage
  2. Frequent GC cycles

πŸ” Step 5: Unassigned Shards Analysis

Unassigned shards have a state of UNASSIGNED. The prirep value is p for primary shards and r for replicas.

curl -X GET "localhost:9200/_cat/shards?v=true&h=index,shard,prirep,state,node,unassigned.reason&s=state&pretty"
Enter fullscreen mode Exit fullscreen mode
[ec2-user@ip-x-x-x-xx ~]$ curl -X GET "localhost:9200/_cat/shards?v=true&h=index,shard,prirep,state,node,unassigned.reason&s=state&pretty"
index                                                       shard  prirep     state          unassigned.reason
product_search_tab_data                                      0     r      UNASSIGNED        NODE_LEFT
metricbeat-7.10.2-2023.02.08-000024                           0     r      UNASSIGNED        NODE_LEFT
metricbeat-7.17.0-2022.12.04-000004                           0     r      UNASSIGNED        NODE_LEFT
.monitoring-es-7-mb-2023.04.16                                0     r      UNASSIGNED        REPLICA_ADDED
.monitoring-es-7-mb-2023.04.14                                0     r      UNASSIGNED        REPLICA_ADDED
apm-7.9.2-span-000002                                         0     r      UNASSIGNED        NODE_LEFT
metricbeat-7.10.2-2021.12.29-000012                           0     r      UNASSIGNED        NODE_LEFT
product_search_analytics                                     0     r      UNASSIGNED        NODE_LEFT
product_search_analytics                                     0     r      UNASSIGNED        NODE_LEFT
product_search_analytics                                     0     r      UNASSIGNED        NODE_LEFT
product_search_analytics                                     0     r      UNASSIGNED        NODE_LEFT
product_fap_model_item                                       0     r      UNASSIGNED        NODE_LEFT
metricbeat-7.10.2-2021.11.29-000011                           0     r      UNASSIGNED        NODE_LEFT
metricbeat-7.17.1-2022.12.07-000008                           0     r      UNASSIGNED        NODE_LEFT
.kibana-event-log-7.9.2-000024                                0     r      UNASSIGNED        NODE_LEFT
.kibana-event-log-7.17.1-000010                               0     r      UNASSIGNED        NODE_LEFT
.monitoring-kibana-7-2023.04.16                               0     r      UNASSIGNED        REPLICA_ADDED
.kibana-event-log-7.9.2-000026                                0     r      UNASSIGNED        INDEX_CREATED
product_fap_price                                            0     r      UNASSIGNED        NODE_LEFT
.ds-.logs-deprecation.elasticsearch-default-2022.12.12-000020 0     r      UNASSIGNED        NODE_LEFT
ilm-history-2-000025                                          0     r      UNASSIGNED        NODE_LEFT
metricbeat-7.17.1-2022.10.08-000006                           0     r      UNASSIGNED        NODE_LEFT
ilm-history-2-000023                                          0     r      UNASSIGNED        NODE_LEFT
product_product_hierarchy                                    0     r      UNASSIGNED        NODE_LEFT
product_product_hierarchy                                    0     r      UNASSIGNED        NODE_LEFT
product_product_hierarchy                                    0     r      UNASSIGNED        NODE_LEFT
product_product_hierarchy                                    0     r      UNASSIGNED        NODE_LEFT
Enter fullscreen mode Exit fullscreen mode

Key Finding:

  1. UNASSIGNED β†’ NODE_LEFT

πŸ‘‰ Meaning:

  1. A node left the cluster
  2. Replica shards not reassigned

πŸ” Step 6: UNASSIGNED Shard Analysis

To understand why an unassigned shard is not being assigned and what action you must take to allow Elasticsearch to assign it, use the cluster allocation explanation API.

curl -X GET "localhost:9200/_cluster/allocation/explain?filter_path=index,node_allocation_decisions.node_name,node_allocation_decisions.deciders.*&pretty"
Enter fullscreen mode Exit fullscreen mode
[ec2-user@ip-x-x-x-xx ~]$ curl -X GET "localhost:9200/_cluster/allocation/explain?filter_path=index,node_allocation_decisions.node_name,node_allocation_decisions.deciders.*&pretty"
{
  "index" : "product_search_tab_data",
  "node_allocation_decisions" : [
    {
      "node_name" : "node-1",
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "a copy of this shard is already allocated to this node [[product_search_tab_data][0], node[EQ6QyUbhQZCZRqP78rMIIQ], [P], s[STARTED], a[id=7vBWLesZQAS4zYjt_ER2bw]]"
        },
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [10.42130719712077%]"
        }
      ]
    },
    {
      "node_name" : "node-5",
      "deciders" : [
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [9.907598002066106%]"
        }
      ]
    },
    {
      "node_name" : "node-2",
      "deciders" : [
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [11.010075893021023%]"
        }
      ]
    },
    {
      "node_name" : "node-3",
      "deciders" : [
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [10.938318653211446%]"
        }
      ]
    },
    {
      "node_name" : "node-4",
      "deciders" : [
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [12.273611767876893%]"
        }
      ]
    }
  ]
}
[ec2-user@ip-x-x-x-xx ~]$

Enter fullscreen mode Exit fullscreen mode

🧠 Root Cause Analysis (RCA)

After correlating all logs, metrics, and cluster behavior, we identified multiple layered issues contributing to the problem.

πŸ”΄ 1. Large Number of Unassigned Shards
193 shards were unassigned

Majority had reason:

UNASSIGNED β†’ NODE_LEFT
Enter fullscreen mode Exit fullscreen mode

πŸ‘‰ Impact:

  1. Continuous shard allocation attempts
  2. Increased cluster overhead
  3. Memory and thread pressure

πŸ”΄ 2. Node Failure (NODE_LEFT)

  1. - One or more nodes temporarily left the cluster
  2. - Replica shards lost their assigned nodes

πŸ‘‰ Result:

  1. Cluster moved to YELLOW state
  2. Triggered rebalancing operations

πŸ”΄ 3. Disk Watermark Threshold Breach (Critical Finding 🚨)

  1. During shard allocation analysis, we found:
"index": "search",
"node_allocation_decisions": [
  {
    "node_name": "node-3",
    "deciders": [
      {
        "decider": "disk_threshold",
        "decision": "NO",
        "explanation": "node above low watermark (85%), free: ~7.6%"
      }
    ]
  },
  {
    "node_name": "node-5",
    "deciders": [
      {
        "decider": "disk_threshold",
        "decision": "NO",
        "explanation": "node above low watermark (85%), free: ~9.6%"
      }
    ]
  },
  {
    "node_name": "node-4",
    "deciders": [
      {
        "decider": "disk_threshold",
        "decision": "NO",
        "explanation": "node above low watermark (85%), free: ~10.7%"
      }
    ]
  }
]
Enter fullscreen mode Exit fullscreen mode

πŸ‘‰ Key Insight:

Elasticsearch refused to allocate shards on nodes
Because disk usage crossed:

cluster.routing.allocation.disk.watermark.low = 85%
Enter fullscreen mode Exit fullscreen mode

πŸ‘‰ Actual situation:

Nodes had only ~7%–10% free disk space
Allocation decision = ❌ NO

⚠️ Why This Is Critical

When disk watermark is breached:

  1. Elasticsearch blocks shard allocation
  2. Unassigned shards remain stuck
  3. Cluster cannot rebalance

πŸ‘‰ This directly caused:

  1. Persistent unassigned shards
  2. Memory pressure
  3. Internal retries β†’ CPU spikes

πŸ”΄ 4. High JVM Memory Pressure

  1. Heap usage consistently high
  2. JVM old-gen heavily utilized

πŸ‘‰ Result:

  1. Frequent Garbage Collection (GC)
  2. CPU spikes during GC cycles

πŸ”΄ 5. Thread Pool Pressure

Even though CPU looked low:

  1. Threads were blocked due to:
  2. Allocation retries
  3. Memory pressure

πŸ‘‰ As per Elasticsearch behavior:

  1. Thread pool exhaustion can trigger CPU-related alerts

🧩 Final Root Cause Summary

The issue was NOT just CPU-related.

It was a combination of:

  • ❌ Disk space exhaustion (Watermark breach)
  • ❌ Unassigned shards (allocation blocked)
  • ❌ Node failure (NODE_LEFT)
  • ❌ High JVM memory pressure
  • ❌ Continuous allocation retries

πŸ› οΈ Final Fix Implemented

After complete analysis, we identified that:

πŸ‘‰ Insufficient disk space was the primary blocker

πŸ”§ Solution Steps
βœ… 1. Increased Disk Capacity

  1. Added +50 GB storage to all Elasticsearch nodes
    πŸ‘‰ Result:

  2. Disk usage dropped below watermark threshold

  3. Shard allocation resumed

monitoring-kibana-7-2023.04.17                               0     p      STARTED    node-5
catelog-7.9.2-span-000010                                    0     p      STARTED    node-1
catelog-7.9.2-span-000010                                    0     r      STARTED    node-3
product_fragments                                            0     p      STARTED    node-3
packetbeat-7.9.3-2023.04.14-000019                            0     p      STARTED    node-5
metricbeat-7.10.2-2022.04.14-000014                           0     p      STARTED    node-3
.ds-.logs-deprecation.elasticsearch-default-2022.09.19-000014 0     p      STARTED    node-1
.ds-ilm-history-5-2023.04.09-000028                           0     p      STARTED    node-5
catelog-7.9.2-profile-000010                                  0     p      STARTED    node-2
catelog-7.9.2-profile-000010                                  0     r      STARTED    node-3
packetbeat-7.9.3-2022.09.16-000012                            0     p      STARTED    node-2
metricbeat-7.13.3-2021.07.11-000001                           0     p      STARTED    node-2
logstash                                                      0     p      STARTED    node-3
.monitoring-es-7-mb-2023.04.12                                0     p      STARTED    node-4
.catelog-custom-link                                          0     p      STARTED    node-1
.catelog-custom-link                                          0     r      STARTED    node-3
catelog-7.9.2-metric-000015                                   0     p      STARTED    node-1
catelog-7.9.2-metric-000015                                   0     r      STARTED    node-3
catelog-7.9.2-profile-000017                                  0     r      STARTED    node-3
catelog-7.9.2-profile-000017                                  0     p      STARTED    node-5
Enter fullscreen mode Exit fullscreen mode

βœ… 2. Rolling Restart

  1. Restarted nodes one by one (rolling restart)

πŸ‘‰ Ensured:

  1. No downtime
  2. Safe cluster recovery

βœ… 3. Automatic Shard Reallocation

  1. Elasticsearch started assigning shards automatically
  2. Cluster began stabilizing

🎯 Final Result
βœ… Unassigned shards β†’ 0
βœ… Cluster status β†’ GREEN
βœ… Memory pressure reduced
βœ… CPU spikes eliminated

[ec2-user@ip-x-x-x-xx ~]$ curl -X GET "localhost:9200/_cluster/health?pretty"
{
  "cluster_name" : "web-test",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 5,
  "number_of_data_nodes" : 5,
  "active_primary_shards" : 247,
  "active_shards" : 536,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Key Learning (Very Important πŸš€)

πŸ”₯ Disk space is directly linked to cluster stability in Elasticsearch

Even if:

  1. CPU looks fine
  2. Memory seems manageable

πŸ‘‰ If disk crosses watermark:

  1. Shards won’t allocate
  2. Cluster will degrade

✍️ Conclusion

This incident was a great reminder that Elasticsearch performance issues are rarely straightforward.

What initially appeared as a high CPU problem turned out to be a cascading failure caused by:

  1. Disk watermark threshold breaches
  2. Unassigned shards
  3. Node failure (NODE_LEFT)
  4. JVM memory pressure
  5. Continuous shard allocation retries

πŸ‘‰ The most critical takeaway:

πŸ”₯ Disk space is not just a storage concern in Elasticsearch β€” it directly impacts shard allocation, memory usage, and overall cluster stability.

Even when CPU usage looks normal, underlying factors like:

  1. Heap pressure
  2. Disk utilization
  3. Cluster health 4.can silently degrade the system until it reaches a breaking point.

πŸš€ Final Thoughts for DevOps Engineers

In production environments, always think beyond surface-level alerts:

  1. Don’t trust CPU metrics alone
  2. Correlate memory, disk, and cluster state
  3. Monitor unassigned shards and disk watermarks proactively
  4. Design clusters with proper shard sizing and capacity planning.

Top comments (0)