Josephat Macharia

Posted on Apr 27, 2021 • Edited on Oct 31, 2021

How Monitoring Works: Prometheus, Grafana and StatsD

#devops #monitoring #programming

What is Monitoring?

The Periodic tracking (for example, daily, weekly, monthly, quarterly, annually) of any activity’s progress by systematically gathering and analyzing data and information

Why monitor?

Helps to answer the following:(Imagine or picture this):

What is happenning right now?
When did things change or what has changed (has it gone up or down did it change when we did this deploy)
When did it happen
How often does it happen e.g how many times a second, a day when do people use this feature, how many queries am i sending to the database
How long did it happen.
Enhance reliability: software bugs can be prevented via:
- testing
- measuring
- monitoring
- analyzing.

What is a metric?

The unit that is being monitored for a specific target.

Things to monitor?

request latency
request count
number of exceptions
cpu utilization

How to Monitor?

we can use traditional methods but why reinvent the wheel use monitoring tools.

What do monitoring tools offer?

collection
- Describes how the measurements are taken.
storage
- how to get metrics you measured from where you measured into something that will store them long term (usually called time series databases).
- This storage also needs to be capable of fetching that stuff quickly for analysis/graphing/alerting etc.
graphing
- visualizing and analyzing metrics collected.
aggregation windows
- how long you can keep metrics before you run out of space or the storage becomes slow.
alerting
- channels used to notify users when things happen.

What are some of the common measurement/metric types?

Counter
- Use this for situations where you want to know “how many times has x happened”.
- for example total number of HTTP requests, or the total number of bytes sent in HTTP
Gauge
- A representation of a metric that can go both up and down.
- “what is the current value of x now”.
- what is the current value of logged in users now.
Histogram

This represents observed metrics sharded into distinct buckets.
Think of this as a mechanism to track “how long something took” or “how big something was”.

Monitoring patterns for reporting metrics

Pull model
- the monitoring system "scrapes" the application at a predefined HTTP endpoint.
Push model, the application sends the data to the monitoring system.

An example of a monitoring system working in the pull model is Prometheus. StatsD is an example of a monitoring system where the application pushes the metrics to the system.

Monitoring tools

We will focus on the following:

StatsD
Prometheus

Whats is StatsD

System for instrumenting metrics
statsd is semi real time and little historical
a funnel that collects data

Why it was creatd?

Devops guys did not understand the internals of an application apart from its CPU utilization and network IO.

How it works?

StatsD will accept measurements from all over your network with UDP.
Aggregates them into 10-seconds chunks.
Sends them off to somewhere that will store the data e.g Graphite

What are the components of StatsD?

API Client --> this integrates with your actual application.
UDP Protocol --> API client just wraps over the UDP protocol.
Daemon --> this deamon runs on your actual application service.
Metrics collection: you fit your application with metrics which will send UDP packets to a local daemon the local daemon will aggregate these packets together and send them in batch to a backend

What metric types are available?

counters - how many x times something happened
timers - it took this long e,g on average
gauges - % completion
sets - keep track of unique values

How does it integrate with my app

Push metrics from your code
clients are available for many languages e.g ntest statsd

Which backends it can integrate with

graphite
mysql, influxdb, mongodb

Prometheus

A monitoring tool that actively scrapes data, stores it, and supports queries, graphs, and alerts, as well as provides endpoints to other API consumers like Grafana or even Graphite itself.

Why it exists

Created to monitor highly dynamic container environments like kubernates and docker swarm etc
However it can used in a traditional non container infrastructure.
It has become the mainstream monotoring tool of choice in container and microservice world.

Where can i use it?

When running multiple servers that run containerized applications and there are x(processes) running on that infrastructure and things are interconnected.
In complex infrastructure with lots of servers distributed

What can go wrong in complex infrastructure?

One service can crush and cause failure to others.This can be difficult to debug manually.
Application downtime
Errors
Overloaded and running out of resources
Response latency

What is the remedy for problems in complex infrastructure

Have a tool that constantly monitors all the services.
Alerts the maintainers as soon as one service crashes e.g you know what happened.
Identify problems before they occur and alert the system admin responsible for that infrastructure to prevent that issue e.g
- when the application is about to run out of storage space
- when an application becomes too slow, have tool that detects network spikes

What it offers

Automated monitoring and alerting for devops workflow

Architecture/How it works

Prometheus Server

Does the actual monitoring work and made up of three components
1. Storage: time series database --> stores all the metrics data e.g current cpu usage, no_of_requests
2. Data retrieval worker responsible for getting or pulling those metrics and pushing them to the database
3. Server API That accepts queries for that stored data and used to display the data/visualize inside prometheus or other graph tool

What is the work of prometheus alert Manager

Responsible for firing alerts via different channels e.g email, slack.
Determines who receives? how it triggers.
Reads the rule, if the condition is met then an alert is fired.

Where is data stored

Stores the data on disk also integrates with remote storage.

What are the characteristics

Designed to be reliable even when other services have an outage.
Standalone and self containing --> does not depend on network storage

What it Monitors

A particulat thing e.g a linux server, windows, database server
Things it monitors are called targets
- For linux server it can be, CPU status, memory disk, space usage
- For application server it can be exceptions count, number of request and request duration

There are 3 primitive metrics types

Counter --> How many times x happened
Gauge --> what is the current value of x now or what is the current capacity of disk space now
Histogram/Timer ---> How long or how big

How it collects metrics

It pulls metrics from a http endpoint whose host adress exposes a /metrics endpoint. for that to work:
- The target must expose the /metrics endpoint
- Data available at /metrics must be in a format that prometheus understands

Target Endpoints and exporters

Some servers are already exposing the /metrics endpoint so no extra work is needed to get metrics from them
Many services dont have native prometheus endpoint and need an extra component known as an exporter

Exporter is a script or service that fetches metrics from ur target and converts them into a format prometheus understands
and exposes these metrics to its own /metrics endpoint

How to monitor a linux/windows server

Download a node exporter
untar and execute
converts metrics of the server
exposes /metrics endpoint
configure promethues to scrape this endpoint

How to monitor an application

At application level you can monitor the following:

How long request are taking
- How many requests Use client libraries to expose /metrics endpoint

Why care

In push mechanism applications/servers push to a centralized platform
high load of network traffic: when working with many microservices and you have each service pushing their metrics to the monitoring system it creates a high load of traffic in your infrastructre and your monitoring can be your bottleneck
must install a daemon on each target to push data to the monitoring server.
Better detect/insight if service is up and running

note: for shortlived gateway it offers a pushgateway

How it know what to scape and when?

All that is confugured in promethues.yml:
- define what targets to scrape and at what interval.
- It uses a service discovery to discover those endpoints.

What is it does not do well

Dificult to scale when you have 100 of servers you might want to have multiple prometheus servers that somewhat aggregate all these metrics data and configuring that can be dificlut

Workarounds

Increase prometheus server capacity.
Limit number of metrics.

Example

Graphana

Allows you to query, visualize alert and understand your metrics no matter where they are stored.

Enables you to create, explore and share dashboards with your team.

Example:
Explore metrics

Ad-hoc queries are queries that are made interactively, with the purpose of exploring data. An ad-hoc query is commonly followed by another, more specific query

1.tns_request_duration_seconds_count

is a counter, a type of metric whose value only ever increases. Rather than visualizing the actual value, you can use counters to calculate the rate of change, i.e. how fast the value increases.
1. Add the rate function to your query to visualize the rate of requests per second. Enter the following in the Query editor and then press Shift + Enter.

rate(tns_request_duration_seconds_count[5m])

Immediately below the graph there’s an area where each time series is listed with a colored icon next to it. This area is called the legend.

PromQL lets you group the time series by their labels, using the sum function.

Add the sum function to your query to group time series by route:

sum(rate(tns_request_duration_seconds_count[5m])) by(route)

Add a logging data source

Grafana supports log data sources, like Loki. Just like for metrics, you first need to add your data source to Grafana.

Loki

Grafana Loki is a set of components that can be composed into a fully featured logging stack.

Grafana supports log data sources, like Loki. Just like for metrics, you first need to add your data source to Grafana.

Explore logs

{filename="/var/log/tns-app.log"}

Grafana displays all logs within the log file of the sample application. The height of each bar encodes the number of logs that were generated at that time.

{filename="/var/log/tns-app.log"} |= "error"

Logs are helpful for understanding what went wrong

you can correlate logs with metrics from Prometheus to better understand the context of the error.

Build a Dashboard

A dashboard gives you an at-a-glance view of your data and lets you track metrics through different visualizations.

Dashboards consist of panels, each representing a part of the story you want your dashboard to tell.

Every panel consists of a query and a visualization. The query defines what data you want to display, whereas the visualization defines how the data is displayed.
Unlike other logging systems, Loki is built around the idea of only indexing metadata about your logs: labels (just like Prometheus labels).

Annotate events

When things go bad, it often helps if you understand the context in which the failure occurred. Time of last deploy, system changes, or database migration can offer insight into what might have caused an outage. Annotations allow you to represent such events directly on your graphs.

query annotations
{filename="/var/log/tns-app.log"} |= "error"

Alerts

Alerts allow you to identify problems in your system moments after they occur. By quickly identifying unintended changes in your system, you can minimize disruptions to your services.

Alerts consists of two parts:

Notification channel - How the alert is delivered. When the conditions of an alert rule are met, the Grafana notifies the channels configured for that alert.
Alert rules - When the alert is triggered. Alert rules are defined by one or more conditions that are regularly evaluated by Grafana.

Supported Alert Notification Channels
Email
Slack
Kafka
Google Hangouts Chat
Microsoft Teams
etc.

Others to look at

TICK stack
Google cloud Monitoring