DEV Community

Stanislav Berkov
Stanislav Berkov

Posted on

Cassandra high CPU load can be caused by prometheus misconfiguration

I have a dev Cassandra cluster that me and other developers use for testing their websites. It has around 50-60 keyspaces. We had an issue with this cluster: constant high CPU usage even with no reads/writes. Cluster has 3 nodes, each node has 4 CPU cores assigned. It used 400% CPU on each node almost every time even having no reads/writes at all. I contacted our internal Cassandra expert regarding this issue and get response that such load was ok due to number of keyspaces. More developers started using cluster and Cassandra started feeling really bad. I tried to deploy my website (that uses Cassandra as storage) and realized that 2 out of 3 nodes are down. This happened a day before already. I decided to get to the bottom of the truth, why cassandra is so slow, at any price.

I encountered an article https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html regarding cassandra tuning.

Page has reference to https://github.com/aragozin/jvm-tools -- really powerful tools for java process stats analysis.

I downloaded binary from https://mavenlibs.com/jar/file/org.gridkit.jvmtool/sjk (since original link was unavailable) and run it with parameters

wget https://repo1.maven.org/maven2/org/gridkit/jvmtool/sjk/0.21/sjk-0.21.jar
java -jar sjk-0.21.jar ttop -s localhost:7199 -o CPU -n 30
Enter fullscreen mode Exit fullscreen mode

It get me a clue what was loading cassandra:

2023-07-28T18:43:01.041-0500 Process summary
  process cpu=106.16%
  application cpu=105.76% (user=104.62% sys=1.15%)
  other: cpu=0.39%
  thread count: 95
  heap allocation rate 14mb/s
[001422] user=30.05% sys= 0.09% alloc= 5407kb/s - prometheus-http-1-3
[000542] user=20.31% sys= 0.02% alloc= 5424kb/s - prometheus-http-1-2
[000078] user=19.28% sys= 0.11% alloc=  659kb/s - prometheus-http-1-1
[002413] user=18.87% sys= 0.09% alloc= 1311kb/s - prometheus-http-1-5
[002050] user=12.92% sys= 0.02% alloc=   15kb/s - prometheus-http-1-4
[004127] user= 0.92% sys= 0.37% alloc=  872kb/s - RMI TCP Connection(64)-172.26.34.166
[000081] user= 0.82% sys= 0.12% alloc=  393kb/s - read-hotness-tracker:1
[000047] user= 0.31% sys= 0.20% alloc=  3160b/s - ScheduledFastTasks:1
[000030] user= 0.31% sys= 0.03% alloc= 1039kb/s - OptionalTasks:1
[003238] user= 0.21% sys= 0.12% alloc=   24kb/s - Native-Transport-Requests-1
[003268] user= 0.10% sys= 0.06% alloc=   15kb/s - Thread-4
[000034] user= 0.10% sys=-0.02% alloc=     0b/s - LocalPool-Cleaner
[000029] user= 0.00% sys= 0.05% alloc=  1560b/s - PERIODIC-COMMIT-LOG-SYNCER
[000015] user= 0.00% sys= 0.05% alloc=  3122b/s - ScheduledTasks:1
[000033] user= 0.00% sys= 0.04% alloc=     0b/s - Reference-Reaper
[004128] user= 0.10% sys=-0.07% alloc=  3643b/s - JMX server connection timeout 4128
[003224] user= 0.00% sys= 0.03% alloc=   11kb/s - GossipStage:1
Enter fullscreen mode Exit fullscreen mode

It was Prometheus metric collector. Looks like it was misconfigured somehow. I disabled it in /etc/cassandra/conf/cassandra-env.sh. And finally CPU load went from 400% to 0%.

Heroku

Build apps, not infrastructure.

Dealing with servers, hardware, and infrastructure can take up your valuable time. Discover the benefits of Heroku, the PaaS of choice for developers since 2007.

Visit Site

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay