(Originally published on Medium)
Introduction
According to RedHat:
Observability refers to the ability to monitor, measure, and understand the state of a system or application by examining its outputs, logs, and performance metrics.
OpenTelemetry is a vendor-neutral framework for observability, and is made of 3* main pillars:
- Metrics
- Traces
- Logs
(* Profiles are an upcoming 4th pillar, but still in development)
In the “Expense Tracker” series, this article will be the first of 3 where I apply OpenTelemetry to my app, primarily with the Grafana stack:
In order to find out which REST API requests are the slowest, what services are hogging the most memory, and how saturated my web server's thread pool is, I need to track the numbers behind them, and the tool fit for that is the first in this observability series: Prometheus.
For Expense Tracker, the exporters I’m interested in include:
- Node exporter: for my Linux VMs
- Windows exporter: for my host machine
- MySQL exporter: for my database
- NGinX exporter
- OpenTelemetry Java Agent: for the backend. I could have also used Spring Actuator or the JMX exporter, but the agent is simpler for this project since it emits all 3 types of telemetry signals.
I’ll be going over the types of metrics used in Prometheus and how I apply them to create my Grafana dashboards.
Prometheus Metric Types
Gauges
How much disk space is available on the system?
The answer is a number that can go up or down as the server runs. This is one of the main metric types in Prometheus, and for the Windows exporter, that metric looks like this:
windows_logical_disk_free_bytes{volume="C:"} 2.6363297792e+10
The first part is the actual metric. The second part in the curly braces is a label, and the number after is the actual value of the metric. The Windows exporter exposes this, and a lot of other metrics at http://localhost:9182/metrics, and Prometheus is made to scrape them at a defined interval, and with enough scrapes, we get a pretty graph:
Counters
How much load is my storage drive experiencing?
The answer lies with a counter. For Windows, the counter looks like this:
Counters only ever go up, unlike gauges, but the key to using them is in finding out how quickly they go up. That’s achieved with the rate function:
rate(windows_logical_disk_write_bytes_total{}[15s])
The time window in square braces needs to be at least bigger than the scrape interval. In my case, I chose a scrape interval of 2.5 seconds, so a rate function with a window of 1 second would return no data.
Histograms
Histograms are more complex, but in general, they’re for aggregation, and the question they helped me answer while I was running my load tests is:
What’s the minimum latency that 95% of requests to GET /transactions are getting?
And simply put, the answer is gotten with a histogram quantile.
histogram_quantile(0.95,
sum by (le,http_route)
(rate(http_server_request_duration_seconds_bucket{http_route="/transactions",http_request_method="GET"}[15m]))
)
The 0.95 is the actual percentile, sum is to aggregate by various labels like request method and request route, and rate is mainly used to adjust the time window.
The graph caps off at 10 because of the default bucket boundaries of le specified by the OpenTelemetry semantic conventions for HTTP metrics:
[ 0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 7.5, 10 ]
le is a special label for “less or equals”, which is used to form a cumulative sum with the relevant metric.
This means that if most of the requests were significantly above 10 seconds, I’d have no way of knowing that from Prometheus.
My options include:
- Relying on Tempo with a query to filter out the minimum duration traces
{traceDuration > 10s && resource.service.name="expense_tracker_backend"}
- Use Tempo Metrics Generator, although by default the bucket only goes up to 16.384 seconds, which was much less than what I was experiencing at the extreme in my load tests
- Use Tempo in combination with exemplars to find samples of the worst offenders
- Customizing my bucket boundaries to include larger values.
I talk about how my performance test reached such high latencies in a later article.
Grafana Dashboards
Using what I’ve learned in combination with the various collectors lets me build some fun dashboards:
MySQL

Variables reference linked here.
Some of the panels include:
- InnoDB memory usage to know if the server is running low
- Bytes sent and received per second
- Count of slow queries to know if any potentially inefficient queries are running
Node Exporter
Here I monitor the hardware similar to the way it's shown in Linux top or Windows Task Manager
RED Dashboard
Where RED stands for
- Rate: number of requests per second, here I show the overall and also break it down by endpoint
- Errors: the error rate. I chose to keep one panel for server errors and another for client errors
- Duration: here I use p95 as discussed earlier, with a time window of 1 hour
Setup
I’ll briefly cover the actual setup and configuration I used
Metric scraping
The most basic configuration to get started with Prometheus is the scrape_configs block with the static_configs parameter:
scrape_configs:
- job_name: windows_exporter
scrape_interval: 2500ms
static_configs:
- targets: ['localhost:9182']
labels:
host_name: "LAPTOP-7H9JJDHB"
The config shown is for Windows, but it’s similar to my configuration for the other services
OpenTelemetry
Enabled at the command line with:
--web.enable-otlp-receiver
More information here. It uses a push-based model, which differs from Prometheus main pull-based model. This is mainly for use with Alloy, which is Grafana’s distribution of the OpenTelemetry collector. My configuration looks like this:
otelcol.exporter.otlphttp "prometheus_exporter_otlp" {
client {
endpoint = "http://localhost:9090/api/v1/otlp"
}
}
otelcol.receiver.otlp "otlp_receiver" {
grpc {
}
http {
}
output {
metrics = [otelcol.exporter.otlphttp.prometheus_exporter_otlp.input]
}
}
And my resulting Alloy graph is this:

Image by juicy_fish on Freepik
Alerting
CPU, Disk usage, and RAM usage are all very volatile metrics compared to storage usage, so I grouped them separately and set different time intervals for alerts to be sent, and then set the notifier to be the email one.
The basic setup looks like this:

And a sample alert looks like this:
Conclusion
So that’s pretty much it. This is the first of the 3 Observability articles, and the next one will be about how I gather all the logs from my various services into a central place using Loki.
Please feel free to discuss your thoughts and suggestions in the comment section
Thank you for your time









Top comments (0)