David

Posted on Apr 6

Learning Fullstack Observability: Metrics

#java #mysql #performance #devops

(Originally published on Medium)

Introduction

According to RedHat:

Observability refers to the ability to monitor, measure, and understand the state of a system or application by examining its outputs, logs, and performance metrics.

OpenTelemetry is a vendor-neutral framework for observability, and is made of 3* main pillars:

Metrics
Traces
Logs

(* Profiles are an upcoming 4th pillar, but still in development)

In the “Expense Tracker” series, this article will be the first of 3 where I apply OpenTelemetry to my app, primarily with the Grafana stack:

In order to find out which REST API requests are the slowest, what services are hogging the most memory, and how saturated my web server's thread pool is, I need to track the numbers behind them, and the tool fit for that is the first in this observability series: Prometheus.

For Expense Tracker, the exporters I’m interested in include:

Node exporter: for my Linux VMs
Windows exporter: for my host machine
MySQL exporter: for my database
NGinX exporter
OpenTelemetry Java Agent: for the backend. I could have also used Spring Actuator or the JMX exporter, but the agent is simpler for this project since it emits all 3 types of telemetry signals.

I’ll be going over the types of metrics used in Prometheus and how I apply them to create my Grafana dashboards.

Prometheus Metric Types

Metric types | Prometheus

Prometheus project documentation for Metric types

prometheus.io

Gauges

How much disk space is available on the system?

The answer is a number that can go up or down as the server runs. This is one of the main metric types in Prometheus, and for the Windows exporter, that metric looks like this:

windows_logical_disk_free_bytes{volume="C:"} 2.6363297792e+10

The first part is the actual metric. The second part in the curly braces is a label, and the number after is the actual value of the metric. The Windows exporter exposes this, and a lot of other metrics at http://localhost:9182/metrics, and Prometheus is made to scrape them at a defined interval, and with enough scrapes, we get a pretty graph:

Counters

How much load is my storage drive experiencing?

The answer lies with a counter. For Windows, the counter looks like this:

Counters only ever go up, unlike gauges, but the key to using them is in finding out how quickly they go up. That’s achieved with the rate function:

rate(windows_logical_disk_write_bytes_total{}[15s])

The time window in square braces needs to be at least bigger than the scrape interval. In my case, I chose a scrape interval of 2.5 seconds, so a rate function with a window of 1 second would return no data.

Histograms

Histograms are more complex, but in general, they’re for aggregation, and the question they helped me answer while I was running my load tests is:

What’s the minimum latency that 95% of requests to GET /transactions are getting?

And simply put, the answer is gotten with a histogram quantile.

histogram_quantile(0.95, 
  sum by (le,http_route) 
    (rate(http_server_request_duration_seconds_bucket{http_route="/transactions",http_request_method="GET"}[15m]))
)

The 0.95 is the actual percentile, sum is to aggregate by various labels like request method and request route, and rate is mainly used to adjust the time window.

The graph caps off at 10 because of the default bucket boundaries of le specified by the OpenTelemetry semantic conventions for HTTP metrics:

[ 0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 7.5, 10 ]

le is a special label for “less or equals”, which is used to form a cumulative sum with the relevant metric.

This means that if most of the requests were significantly above 10 seconds, I’d have no way of knowing that from Prometheus.

My options include:

Relying on Tempo with a query to filter out the minimum duration traces

{traceDuration > 10s && resource.service.name="expense_tracker_backend"}

Use Tempo Metrics Generator, although by default the bucket only goes up to 16.384 seconds, which was much less than what I was experiencing at the extreme in my load tests
Use Tempo in combination with exemplars to find samples of the worst offenders
Customizing my bucket boundaries to include larger values.

I talk about how my performance test reached such high latencies in a later article.

Grafana Dashboards

Using what I’ve learned in combination with the various collectors lets me build some fun dashboards:

MySQL

Variables reference linked here.
Some of the panels include:

InnoDB memory usage to know if the server is running low
Bytes sent and received per second
Count of slow queries to know if any potentially inefficient queries are running

Node Exporter

Here I monitor the hardware similar to the way it's shown in Linux top or Windows Task Manager

RED Dashboard

Where RED stands for

Rate: number of requests per second, here I show the overall and also break it down by endpoint
Errors: the error rate. I chose to keep one panel for server errors and another for client errors
Duration: here I use p95 as discussed earlier, with a time window of 1 hour

Setup

I’ll briefly cover the actual setup and configuration I used

Metric scraping

The most basic configuration to get started with Prometheus is the scrape_configs block with the static_configs parameter:

scrape_configs:
 - job_name: windows_exporter
   scrape_interval: 2500ms
   static_configs:
    - targets: ['localhost:9182']
      labels:
        host_name: "LAPTOP-7H9JJDHB"

The config shown is for Windows, but it’s similar to my configuration for the other services

OpenTelemetry

Enabled at the command line with:

--web.enable-otlp-receiver

More information here. It uses a push-based model, which differs from Prometheus main pull-based model. This is mainly for use with Alloy, which is Grafana’s distribution of the OpenTelemetry collector. My configuration looks like this:

otelcol.exporter.otlphttp "prometheus_exporter_otlp" {
    client {
        endpoint = "http://localhost:9090/api/v1/otlp"
    }
}

otelcol.receiver.otlp "otlp_receiver" {
    grpc {
    }

    http {
    }
    output {
        metrics = [otelcol.exporter.otlphttp.prometheus_exporter_otlp.input]
    }
}

And my resulting Alloy graph is this:

Image by juicy_fish on Freepik

Alerting

CPU, Disk usage, and RAM usage are all very volatile metrics compared to storage usage, so I grouped them separately and set different time intervals for alerts to be sent, and then set the notifier to be the email one.

The basic setup looks like this:

And a sample alert looks like this:

Conclusion

So that’s pretty much it. This is the first of the 3 Observability articles, and the next one will be about how I gather all the logs from my various services into a central place using Loki.

Please feel free to discuss your thoughts and suggestions in the comment section

Thank you for your time

DEV Community