DEV Community

David
David

Posted on

Learning Fullstack Observability: Metrics

(Originally published on Medium)

Introduction

According to RedHat:

Observability refers to the ability to monitor, measure, and understand the state of a system or application by examining its outputs, logs, and performance metrics.

OpenTelemetry is a vendor-neutral framework for observability, and is made of 3* main pillars:

  • Metrics
  • Traces
  • Logs

(* Profiles are an upcoming 4th pillar, but still in development)

In the “Expense Tracker” series, this article will be the first of 3 where I apply OpenTelemetry to my app, primarily with the Grafana stack:

In order to find out which REST API requests are the slowest, what services are hogging the most memory, and how saturated my web server's thread pool is, I need to track the numbers behind them, and the tool fit for that is the first in this observability series: Prometheus.

For Expense Tracker, the exporters I’m interested in include:

I’ll be going over the types of metrics used in Prometheus and how I apply them to create my Grafana dashboards.

Prometheus Metric Types

Metric types | Prometheus

Prometheus project documentation for Metric types

favicon prometheus.io

Gauges

How much disk space is available on the system?

The answer is a number that can go up or down as the server runs. This is one of the main metric types in Prometheus, and for the Windows exporter, that metric looks like this:

windows_logical_disk_free_bytes{volume="C:"} 2.6363297792e+10
Enter fullscreen mode Exit fullscreen mode

The first part is the actual metric. The second part in the curly braces is a label, and the number after is the actual value of the metric. The Windows exporter exposes this, and a lot of other metrics at http://localhost:9182/metrics, and Prometheus is made to scrape them at a defined interval, and with enough scrapes, we get a pretty graph:

Prometheus native query page

Counters

How much load is my storage drive experiencing?

The answer lies with a counter. For Windows, the counter looks like this:

Example of a counter

Counters only ever go up, unlike gauges, but the key to using them is in finding out how quickly they go up. That’s achieved with the rate function:

rate(windows_logical_disk_write_bytes_total{}[15s])
Enter fullscreen mode Exit fullscreen mode

Rate applied to counter

The time window in square braces needs to be at least bigger than the scrape interval. In my case, I chose a scrape interval of 2.5 seconds, so a rate function with a window of 1 second would return no data.

Histograms

Histograms are more complex, but in general, they’re for aggregation, and the question they helped me answer while I was running my load tests is:

What’s the minimum latency that 95% of requests to GET /transactions are getting?

And simply put, the answer is gotten with a histogram quantile.

histogram_quantile(0.95, 
  sum by (le,http_route) 
    (rate(http_server_request_duration_seconds_bucket{http_route="/transactions",http_request_method="GET"}[15m]))
)
Enter fullscreen mode Exit fullscreen mode

Histogram quantile graph

The 0.95 is the actual percentile, sum is to aggregate by various labels like request method and request route, and rate is mainly used to adjust the time window.

The graph caps off at 10 because of the default bucket boundaries of le specified by the OpenTelemetry semantic conventions for HTTP metrics:

[ 0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 7.5, 10 ]

le is a special label for “less or equals”, which is used to form a cumulative sum with the relevant metric.

This means that if most of the requests were significantly above 10 seconds, I’d have no way of knowing that from Prometheus.

My options include:

  • Relying on Tempo with a query to filter out the minimum duration traces
{traceDuration > 10s && resource.service.name="expense_tracker_backend"}
Enter fullscreen mode Exit fullscreen mode

Tempo time filter

  • Use Tempo Metrics Generator, although by default the bucket only goes up to 16.384 seconds, which was much less than what I was experiencing at the extreme in my load tests
  • Use Tempo in combination with exemplars to find samples of the worst offenders
  • Customizing my bucket boundaries to include larger values.

I talk about how my performance test reached such high latencies in a later article.

Grafana Dashboards

Using what I’ve learned in combination with the various collectors lets me build some fun dashboards:

MySQL

MySQL Basics Dashboard
Variables reference linked here.
Some of the panels include:

  • InnoDB memory usage to know if the server is running low
  • Bytes sent and received per second
  • Count of slow queries to know if any potentially inefficient queries are running

Node Exporter

Node Exporter Dashboard

Here I monitor the hardware similar to the way it's shown in Linux top or Windows Task Manager

RED Dashboard

Rate, Error, Duration dashboard

Where RED stands for

  • Rate: number of requests per second, here I show the overall and also break it down by endpoint
  • Errors: the error rate. I chose to keep one panel for server errors and another for client errors
  • Duration: here I use p95 as discussed earlier, with a time window of 1 hour

Setup

I’ll briefly cover the actual setup and configuration I used

Metric scraping

The most basic configuration to get started with Prometheus is the scrape_configs block with the static_configs parameter:

scrape_configs:
 - job_name: windows_exporter
   scrape_interval: 2500ms
   static_configs:
    - targets: ['localhost:9182']
      labels:
        host_name: "LAPTOP-7H9JJDHB"
Enter fullscreen mode Exit fullscreen mode

The config shown is for Windows, but it’s similar to my configuration for the other services

OpenTelemetry

Enabled at the command line with:

--web.enable-otlp-receiver
Enter fullscreen mode Exit fullscreen mode

More information here. It uses a push-based model, which differs from Prometheus main pull-based model. This is mainly for use with Alloy, which is Grafana’s distribution of the OpenTelemetry collector. My configuration looks like this:

otelcol.exporter.otlphttp "prometheus_exporter_otlp" {
    client {
        endpoint = "http://localhost:9090/api/v1/otlp"
    }
}

otelcol.receiver.otlp "otlp_receiver" {
    grpc {
    }

    http {
    }
    output {
        metrics = [otelcol.exporter.otlphttp.prometheus_exporter_otlp.input]
    }
}
Enter fullscreen mode Exit fullscreen mode

And my resulting Alloy graph is this:

Alloy OpenTelemetry graph

Alert icon
Image by juicy_fish on Freepik

Alerting

CPU, Disk usage, and RAM usage are all very volatile metrics compared to storage usage, so I grouped them separately and set different time intervals for alerts to be sent, and then set the notifier to be the email one.

The basic setup looks like this:

Prometheus Alerting Example
And a sample alert looks like this:

Prometheus Email Example

Conclusion

So that’s pretty much it. This is the first of the 3 Observability articles, and the next one will be about how I gather all the logs from my various services into a central place using Loki.

Please feel free to discuss your thoughts and suggestions in the comment section

Thank you for your time

Top comments (0)