What is Monitoring?
The Periodic tracking (for example, daily, weekly, monthly, quarterly, annually) of any activity’s progress by systematically gathering and analyzing data and information
Why monitor?
Helps to answer the following:(Imagine or picture this):
- What is happenning right now?
- When did things change or what has changed (has it gone up or down did it change when we did this deploy)
- When did it happen
- How often does it happen e.g how many times a second, a day when do people use this feature, how many queries am i sending to the database
- How long did it happen.
- Enhance reliability: software bugs can be prevented via:
- testing
- measuring
- monitoring
- analyzing.
What is a metric?
- The unit that is being monitored for a specific target.
Things to monitor?
- request latency
- request count
- number of exceptions
- cpu utilization
How to Monitor?
we can use traditional methods but why reinvent the wheel use monitoring tools.
What do monitoring tools offer?
-
collection
- Describes how the measurements are taken.
-
storage
- how to get metrics you measured from where you measured into something that will store them long term (usually called time series databases).
- This storage also needs to be capable of fetching that stuff quickly for analysis/graphing/alerting etc.
-
graphing
- visualizing and analyzing metrics collected.
-
aggregation windows
- how long you can keep metrics before you run out of space or the storage becomes slow.
-
alerting
- channels used to notify users when things happen.
What are some of the common measurement/metric types?
-
Counter
- Use this for situations where you want to know “how many times has x happened”.
- for example total number of HTTP requests, or the total number of bytes sent in HTTP
-
Gauge
- A representation of a metric that can go both up and down.
- “what is the current value of x now”.
- what is the current value of logged in users now.
Histogram
- This represents observed metrics sharded into distinct buckets.
- Think of this as a mechanism to track “how long something took” or “how big something was”.
Monitoring patterns for reporting metrics
- Pull model
- the monitoring system "scrapes" the application at a predefined HTTP endpoint.
- Push model, the application sends the data to the monitoring system.
An example of a monitoring system working in the pull model is Prometheus. StatsD is an example of a monitoring system where the application pushes the metrics to the system.
Monitoring tools
We will focus on the following:
- StatsD
- Prometheus
Whats is StatsD
- System for instrumenting metrics
- statsd is semi real time and little historical
- a funnel that collects data
Why it was creatd?
- Devops guys did not understand the internals of an application apart from its CPU utilization and network IO.
How it works?
- StatsD will accept measurements from all over your network with UDP.
- Aggregates them into 10-seconds chunks.
- Sends them off to somewhere that will store the data e.g Graphite
What are the components of StatsD?
- API Client --> this integrates with your actual application.
- UDP Protocol --> API client just wraps over the UDP protocol.
- Daemon --> this deamon runs on your actual application service.
- Metrics collection: you fit your application with metrics which will send UDP packets to a local daemon the local daemon will aggregate these packets together and send them in batch to a backend
What metric types are available?
- counters - how many x times something happened
- timers - it took this long e,g on average
- gauges - % completion
- sets - keep track of unique values
How does it integrate with my app
- Push metrics from your code
- clients are available for many languages e.g ntest statsd
Which backends it can integrate with
- graphite
- mysql, influxdb, mongodb
Prometheus
A monitoring tool that actively scrapes data, stores it, and supports queries, graphs, and alerts, as well as provides endpoints to other API consumers like Grafana or even Graphite itself.
Why it exists
- Created to monitor highly dynamic container environments like kubernates and docker swarm etc
- However it can used in a traditional non container infrastructure.
- It has become the mainstream monotoring tool of choice in container and microservice world.
Where can i use it?
- When running multiple servers that run containerized applications and there are x(processes) running on that infrastructure and things are interconnected.
- In complex infrastructure with lots of servers distributed
What can go wrong in complex infrastructure?
- One service can crush and cause failure to others.This can be difficult to debug manually.
- Application downtime
- Errors
- Overloaded and running out of resources
- Response latency
What is the remedy for problems in complex infrastructure
- Have a tool that constantly monitors all the services.
- Alerts the maintainers as soon as one service crashes e.g you know what happened.
- Identify problems before they occur and alert the system admin responsible for that infrastructure to prevent that issue e.g
- when the application is about to run out of storage space
- when an application becomes too slow, have tool that detects network spikes
What it offers
- Automated monitoring and alerting for devops workflow
Architecture/How it works
Prometheus Server
- Does the actual monitoring work and made up of three components
- Storage: time series database --> stores all the metrics data e.g current cpu usage, no_of_requests
- Data retrieval worker responsible for getting or pulling those metrics and pushing them to the database
- Server API That accepts queries for that stored data and used to display the data/visualize inside prometheus or other graph tool
What is the work of prometheus alert Manager
- Responsible for firing alerts via different channels e.g email, slack.
- Determines who receives? how it triggers.
- Reads the rule, if the condition is met then an alert is fired.
Where is data stored
- Stores the data on disk also integrates with remote storage.
What are the characteristics
- Designed to be reliable even when other services have an outage.
- Standalone and self containing --> does not depend on network storage
What it Monitors
- A particulat thing e.g a linux server, windows, database server
- Things it monitors are called targets
- For linux server it can be, CPU status, memory disk, space usage
- For application server it can be exceptions count, number of request and request duration
There are 3 primitive metrics types
- Counter --> How many times x happened
- Gauge --> what is the current value of x now or what is the current capacity of disk space now
- Histogram/Timer ---> How long or how big
How it collects metrics
- It pulls metrics from a http endpoint whose host adress exposes a /metrics endpoint.
for that to work:
- The target must expose the /metrics endpoint
- Data available at /metrics must be in a format that prometheus understands
Target Endpoints and exporters
- Some servers are already exposing the /metrics endpoint so no extra work is needed to get metrics from them
- Many services dont have native prometheus endpoint and need an extra component known as an exporter
Exporter is a script or service that fetches metrics from ur target and converts them into a format prometheus understands
and exposes these metrics to its own /metrics endpoint
How to monitor a linux/windows server
- Download a node exporter
- untar and execute
- converts metrics of the server
- exposes /metrics endpoint
- configure promethues to scrape this endpoint
How to monitor an application
At application level you can monitor the following:
- How long request are taking
- How many requests Use client libraries to expose /metrics endpoint
Why care
- In push mechanism applications/servers push to a centralized platform
- high load of network traffic: when working with many microservices and you have each service pushing their metrics to the monitoring system it creates a high load of traffic in your infrastructre and your monitoring can be your bottleneck
- must install a daemon on each target to push data to the monitoring server.
- Better detect/insight if service is up and running
note: for shortlived gateway it offers a pushgateway
How it know what to scape and when?
- All that is confugured in promethues.yml:
- define what targets to scrape and at what interval.
- It uses a service discovery to discover those endpoints.
What is it does not do well
- Dificult to scale when you have 100 of servers you might want to have multiple prometheus servers that somewhat aggregate all these metrics data and configuring that can be dificlut
Workarounds
- Increase prometheus server capacity.
- Limit number of metrics.
Example
Graphana
Allows you to query, visualize alert and understand your metrics no matter where they are stored.
- Enables you to create, explore and share dashboards with your team.
Example:
Explore metrics
- Ad-hoc queries are queries that are made interactively, with the purpose of exploring data. An ad-hoc query is commonly followed by another, more specific query
1.tns_request_duration_seconds_count
- is a counter, a type of metric whose value only ever increases. Rather than visualizing the actual value, you can use counters to calculate the rate of change, i.e. how fast the value increases.
- Add the rate function to your query to visualize the rate of requests per second. Enter the following in the Query editor and then press Shift + Enter.
rate(tns_request_duration_seconds_count[5m])
Immediately below the graph there’s an area where each time series is listed with a colored icon next to it. This area is called the legend.
PromQL lets you group the time series by their labels, using the sum function.
Add the sum function to your query to group time series by route:
sum(rate(tns_request_duration_seconds_count[5m])) by(route)
Add a logging data source
Grafana supports log data sources, like Loki. Just like for metrics, you first need to add your data source to Grafana.
Loki
Grafana Loki is a set of components that can be composed into a fully featured logging stack.
Grafana supports log data sources, like Loki. Just like for metrics, you first need to add your data source to Grafana.
Explore logs
- {filename="/var/log/tns-app.log"}
Grafana displays all logs within the log file of the sample application. The height of each bar encodes the number of logs that were generated at that time.
{filename="/var/log/tns-app.log"} |= "error"
Logs are helpful for understanding what went wrong
- you can correlate logs with metrics from Prometheus to better understand the context of the error.
Build a Dashboard
A dashboard gives you an at-a-glance view of your data and lets you track metrics through different visualizations.
Dashboards consist of panels, each representing a part of the story you want your dashboard to tell.
Every panel consists of a query and a visualization. The query defines what data you want to display, whereas the visualization defines how the data is displayed.
Unlike other logging systems, Loki is built around the idea of only indexing metadata about your logs: labels (just like Prometheus labels).
Annotate events
When things go bad, it often helps if you understand the context in which the failure occurred. Time of last deploy, system changes, or database migration can offer insight into what might have caused an outage. Annotations allow you to represent such events directly on your graphs.
query annotations
{filename="/var/log/tns-app.log"} |= "error"
Alerts
Alerts allow you to identify problems in your system moments after they occur. By quickly identifying unintended changes in your system, you can minimize disruptions to your services.
Alerts consists of two parts:
Notification channel - How the alert is delivered. When the conditions of an alert rule are met, the Grafana notifies the channels configured for that alert.
Alert rules - When the alert is triggered. Alert rules are defined by one or more conditions that are regularly evaluated by Grafana.
Supported Alert Notification Channels
Email
Slack
Kafka
Google Hangouts Chat
Microsoft Teams
etc.
Others to look at
- TICK stack
- Google cloud Monitoring
Top comments (0)