DEV Community

Cover image for How to Design Metrics With Prometheus Metric Types: the USE Method
Robert Nemet
Robert Nemet

Posted on • Originally published at rnemet.dev

How to Design Metrics With Prometheus Metric Types: the USE Method

This is the third part of a series about designing metrics for event-driven systems. You can check the first part and the second part of this series before proceeding.

While I discussed the general principles of designing metrics in the first part, I explained Prometheus metric types in the second part. I applied them as the RED method in the second part. In this article, I'll explain the USE method with Prometheus. Finally, a short discussion about the Four Golden Signals and a conclusion about all the methods.

Let's go...

The USE Method

The USE method by Brendan Gregg is a set of rules for designing metrics mainly used for the system not exposed to the users, like databases, message brokers, streaming platforms, etc.
Its key metrics are:

  • Utilization - the level to which a resource has been used
  • Errors - distribution of the number of errors per time
  • Saturation - the level to which a resource has extra work which can not be handled. It has to wait or drop extra work.

Implementation

I'll make an example of the USE method observing a CPU, memory, and network to simplify things and be close to what we use in daily work. I did examples using docker-compose, Prometheus, and Grafana. To get metrics from the system, I'm using the node-exporter. The complete example is in my github repo.

CPU Utilization

CPU utilization is the percentage of time the CPU is busy. The node-exporter provides the node_cpu_seconds_total metrics. This metric is a counter which counts the number of seconds the CPU has spent in each mode. One of the modes is idle, which is when the CPU is not busy.

In a period, say 1m, observe an average change in the idle counter. When subtracting a previously calculated value
from 1, we get the CPU utilization:

1 - avg(rate(node_cpu_seconds_total{mode="idle"}[1m]))
Enter fullscreen mode Exit fullscreen mode

It is the same principle as in the RED method. We use counters, observe the rate of change, and then calculate the average.

If you are interested, continue to the rest on my blog.

Top comments (0)