Anomaly Alerts for monitoring using Grafana and Prometheus

#anomaly #monitoring #alerts #grafana

I have been working on a side project that started during pandemic building a DoH service with Ad and Tracking blocking. Since then it has grown over to be hosted in 7 locations and I collect logs, data and metrics using prometheus from a distributed global stack over to remote server to provide a consolidated view over all things obesrvability using Grafana while recording all my DNS traffic and streaming it using Elixir and Phoenix liveview via Kafka

It is now the most critical service that I use in my daily routine. Any downtime or performance bottlenecks degrades my browsing experience or worse take me off the grid. What started off as a side project has over the last few years become a core development activity and a learning pad for me.

As I relied more and more on my service uptime using it on my mobile, tablet and desktop, the need to monitor and observe the systems grew with guarantees required over SLA(s) and SLO(s).

As usual, I started off by setting up threshold alerts and soon the alerts became too noisy as all these nodes are very small ranging from 1-2GB Mem with 1vCPU. One of the objectives of this activity was also to push cheap VPS and see their performance. Currently it runs 13 services per node on limited hardware that requires careful considerations and observations

Most of the stress was on the node due to its limited capacity and the alerts would be for Load, CPU, Memory and Disk on node level and not the services level.

To overcome the noise and be able to better understand the usage pattern and attend to service issues started looking into anomaly detection and came across this fantastic blog post about "Grafana Prometheus: Detecting anomalies in time series". It does a great job at explaining that "the 3-sigma rule states that approximately all our “normal” data should be within 3 standard deviations of the average value of your data."
This can be stated mathametically as

Based on this formulated the below PromQL queries for CPU, Memory, Load and Disk anomaly detection. Overall, these expression can be used to identify instances where the current idle CPU time, current available memory, current available disk space or current 15-minute load average deviates significantly from the historical average, potentially indicating an anomaly or unusual behavior in CPU usage.

CPU



(avg_over_time(node_cpu_seconds_total{instance="mark-00-sin", job="node-exporter-mark-00-sin", mode="idle"}[$__rate_interval])-avg_over_time(node_cpu_seconds_total{instance="mark-00-sin", job="node-exporter-mark-00-sin", mode="idle"}[1d]))/stddev_over_time(node_cpu_seconds_total{instance="mark-00-sin", job="node-exporter-mark-00-sin", mode="idle"}[1d])

Here's a breakdown of the query:

avg_over_time(node_cpu_seconds_total{instance="mark-00-sin", job="node-exporter-mark-00-sin", mode="idle"}[$__rate_interval])
- This part calculates the average idle CPU time over a time range specified by $__rate_interval.
- node_cpu_seconds_total is a metric that represents the total number of seconds the CPU has spent in various states. In this case, we're interested in the "idle" state, which is when the CPU is not doing any work.
- instance, job, and mode are labels used to filter the metric to a specific instance of the node-exporter job in idle mode.
- avg_over_time(node_cpu_seconds_total{instance="mark-00-sin", job="node-exporter-mark-00-sin", mode="idle"}[1d])
- This part subtracts the average idle CPU time over the last day from the average calculated in the previous step.
- This helps in understanding how the current average compares to historical averages.
/stddev_over_time(node_cpu_seconds_total{instance="mark-00-sin", job="node-exporter-mark-00-sin", mode="idle"}[1d])
- This part divides the result from step 2 by the standard deviation of the idle CPU time over the last day.
- The standard deviation measures the spread or dispersion of a set of data points. In this context, it helps in understanding how much the current value deviates from the historical average.

Memory



(avg_over_time(node_memory_MemAvailable_bytes{instance="mark-00-sin",job="node-exporter-mark-00-sin"}[$__rate_interval])-avg_over_time(node_memory_MemAvailable_bytes{instance="mark-00-sin",job="node-exporter-mark-00-sin"}[1d]))/(stddev_over_time(node_memory_MemAvailable_bytes{instance="mark-00-sin",job="node-exporter-mark-00-sin"}[1d]))

Here's a breakdown of the query:

avg_over_time(node_memory_MemAvailable_bytes{instance="mark-00-sin", job="node-exporter-mark-00-sin"}[$__rate_interval])
- This part calculates the average available memory over a time range specified by $__rate_interval.
- node_memory_MemAvailable_bytes is a metric representing the amount of available memory on the system.
-avg_over_time(node_memory_MemAvailable_bytes{instance="mark-00-sin", job="node-exporter-mark-00-sin"}[1d])
- This part subtracts the average available memory over the last day from the average calculated in the previous step. This helps in understanding how the current average compares to historical averages.
/(stddev_over_time(node_memory_MemAvailable_bytes{instance="mark-00-sin", job="node-exporter-mark-00-sin"}[1d]))
- This part divides the result from step 2 by the standard deviation of the available memory over the last day.
- The standard deviation measures the spread or dispersion of a set of data points. In this context, it helps in understanding how much the current available memory value deviates from the historical average.

Load



(avg_over_time(node_load15{instance="mark-00-sin",job="node-exporter-mark-00-sin"}[$__rate_interval]) - avg_over_time(node_load15{instance="mark-00-sin",job="node-exporter-mark-00-sin"}[1d]))/stddev_over_time(node_load15{instance="mark-00-sin",job="node-exporter-mark-00-sin"}[1d])

Here's a breakdown of the query:

avg_over_time(node_load15{instance="mark-00-sin", job="node-exporter-mark-00-sin"}[$__rate_interval])
- This part calculates the average 15-minute load average over a time range specified by $__rate_interval.
- node_load15 is a metric representing the 15-minute load average on the system.
- avg_over_time(node_load15{instance="mark-00-sin", job="node-exporter-mark-00-sin"}[1d])
- This part subtracts the average 15-minute load average over the last day from the average calculated in the previous step. This helps in understanding how the current average compares to historical averages.
/stddev_over_time(node_load15{instance="mark-00-sin", job="node-exporter-mark-00-sin"}[1d])
- This part divides the result from step 2 by the standard deviation of the 15-minute load average over the last day.
- The standard deviation measures the spread or dispersion of a set of data points. In this context, it helps in understanding how much the current value deviates from the historical average.

Disk



(avg_over_time(node_filesystem_avail_bytes{instance="mark-00-sin",job="node-exporter-mark-00-sin",device="/dev/sda"}[$__rate_interval]) - avg_over_time(node_filesystem_avail_bytes{instance="mark-00-sin",job="node-exporter-mark-00-sin",device="/dev/sda"}[1d]))/stddev_over_time(node_filesystem_avail_bytes{instance="mark-00-sin",job="node-exporter-mark-00-sin",device="/dev/sda"}[1d])

Here's a breakdown of the query:

avg_over_time(node_filesystem_avail_bytes{instance="mark-00-sin", job="node-exporter-mark-00-sin", device="/dev/sda"}[$__rate_interval])
- This part calculates the average available disk space on the specified device over a time range specified by $__rate_interval.
- node_filesystem_avail_bytes is a metric representing the available bytes on a filesystem.
- avg_over_time(node_filesystem_avail_bytes{instance="mark-00-sin", job="node-exporter-mark-00-sin", device="/dev/sda"}[1d])
- This part subtracts the average available disk space over the last day from the average calculated in the previous step. This helps in understanding how the current average compares to historical averages.
/(stddev_over_time(node_filesystem_avail_bytes{instance="mark-00-sin", job="node-exporter-mark-00-sin", device="/dev/sda"}[1d]))
- This part divides the result from step 2 by the standard deviation of the available disk space over the last day.
- The standard deviation measures the spread or dispersion of a set of data points. In this context, it helps in understanding how much the current available disk space value deviates from the historical average.

Thats a lot of PromQL theory but does it work any better than Threshold alerts and brings some sanity to alerting.

Here are some comparisons between the Threshold and Anomaly based alerts and how they provide better insights and compliment each other.

This is a memory alert that is firing because it has breached the threshold value

While at the same time the anomaly alert is normal. Indicating that the usage is within the expected range

Here is an example were the Load Threshold and Anomaly both fired which is a good example to indicate that they don't work exclusively and compliment eachother.

There is only Load based anomaly alert but no memory anomaly on the node

but according to threshold both load and memory have breached their thresholds.

This gives you good insights when there is no anomaly alert but noisy threshold alert. Either you have set the threshold too low or your allocation is not optimal. Both anomaly and threashold should be in tandem to give you confidence.

Another example based on Disk anomaly indicating that there has been a drop in disk usage based on threshold you would probably never trigger an alert until the usage rises or drops beyond a value but with anomaly it indicates there could be some problem with your application writing logs or degrading hardware.

Just to end it with one more Load based alert where the Threshold alert remains normal while anomaly alert gets triggered as there is sudden spike in Load but the value remains within the threshold and should be something to be investigated and observed as it can indicate unexpected traffic or long running zombie process consuming resources.

As always, the effectiveness of anomaly detection depends on the quality and consistency of the data being collected, in my case using Prometheus, and you may need to adjust thresholds or use more advanced techniques based on your specific use case and system characteristics.