Site Reliability Engineering (SRE) and related concepts are very popular lately, in part due to the famous Google SRE book and others talking about the “Golden Signals” that you should be monitoring to keep your systems fast and reliable as they scale.
Everyone seems to agree these signals are important, but how do you actually monitor them? No one seems to talk much about this.
These signals are much harder to get than traditional CPU or RAM monitoring, as each service and resource has different metrics, definitions, and especially tools required.
Microservices, Containers, and Serverless make getting signals even more challenging, but still worthwhile, as we need these signalsâ€Š–â€Šboth to avoid traditional alert noise and to effectively troubleshoot our increasingly complex distributed systems.
This series of articles will walk through the signals and practical methods for a number of common services. First, we’ll talk briefly about the signals themselves, then a bit about how you can use them in your monitoring system.
Finally, there is a list of service-specific guides on how to monitor the signals, for Load Balancers, Web & App Servers, DB & Cache Servers, General Linux, and more. This list and the details may evolve over time as we continually seek feedback and better methods to get better data.
There are three common lists or methodologies:
- From the Google SRE book: Latency, Traffic, Errors, and Saturation
- USE Method (from Brendan Gregg): Utilization, Saturation, and Errors
- RED Method (from Tom Wilkie): Rate, Errors, and Duration
You can see the overlap, and as Baron Schwartz notes in his Monitoring & Observability with USE and RED blog, each method varies in focus. He suggests USE is about resources with an internal view, while RED is about requests, real work, and thus an external view (from the service consumer’s point of view). They are obviously related, and also complementary, as every service consumes resources to do work.
For our purposes, we’ll focus on a simple superset of five signals:
- Rateâ€Š –â€ŠRequest rate, in requests/sec
- Errors –â€ŠError rate, in errors/sec
- Latency â€Š–â€ŠResponse time, including queue/wait time, in milliseconds.
- Saturation â€Š–â€ŠHow overloaded something is, which is related to utilization but more directly measured by things like queue depth (or sometimes concurrency). As a queue measurement, this becomes non-zero when you are saturated, often not much before. Usually a counter.
- Utilization â€Š–â€ŠHow busy the resource or system is. Usually expressed 0–100% and most useful for predictions (as Saturation is probably more useful). Note we are not using the Utilization Law to get this (~Rate x Service Time / Workers), but instead looking for more familiar direct measurements.
Of these Saturation & Utilization are often the hardest to get, and full of assumptions, caveats and calculation complexitiesâ€Š–â€Štreat them as approximations, at best. However, they are often most valuable for hunting down current and future problems, so we put up with learn to love them.
All these measurements can be split and/or aggregated by various things. For example, HTTP could be spilt out to 4xx & 5xx errors, just as Latency or Rate could be broken out by URL.
In addition, there are more sophisticated ways to calculate things. For example, errors often have lower latency than successful requests so you could exclude errors from Latency, if you can (often you cannot).
As useful as these splits or aggregates are, they are outside the scope of this a article as they get much closer to metrics, events, high-cardinality analysis, etc. Let’s focus on getting the basic data first, as that’s hard enough.
You can skip ahead to the actual data collection guides at the end of this post if you’d like, but we should talk about what to do with these signals once we have them.
One of the key reasons these are “Golden” Signals is they try to measure things that directly affect the end-user and work-producing parts of the systemâ€Š–â€Šthey are direct measurements of things that matter.
This means, in theory, they are better and more useful than lots of less-direct measurements such as CPU, RAM, networks, replication lag, and endless other things (we should know, as we often monitor 200+ items per server; never a good time).
We collect the Golden Signals for a few reasons:
- Alertingâ€Š–â€ŠTell us when something is wrong
- Troubleshootingâ€Š–â€ŠHelp us find & fix the problem
- Tuning & Capacity Planningâ€Š–â€ŠHelp us make things better over time
Our focus here is on Alerting, and how to alert on these Signals. What you do after that is between you and your signals.
Alerting has traditionally used static thresholds, in our beloved (ha!) Nagios, Zabbix, DataDog, etc. systems. That works, but is hard to set well and generates lots of alert noise, as you (and anyone you are living with) are mostly likely acutely aware.
But start with static if you must, based on your experience and best practices. These often work best when set to levels where we’re pretty sure something is wrong ,or at least unusual is going on (e.g. 95% CPU, latency over 10 seconds, modest size queues, error rates above a few per second, etc.)
If you use static alerting, don’t forget the lower bound alerts, such as near zero requests per second or latency, as these often mean something is wrong, even at 3 a.m. when traffic is light.
These alerts typically use average values, but do yourself a favor and try to use median values if you can, as these are less sensitive to big/small outlier values.
Averages have other problems, too, as Optimizely points out in their blog. Still, averages/medians are easily understood, accessible, and quite useful as a signal, as long as your measuring window is short (e.g. 1–5 minutes).
Even better is to start thinking about percentiles. For example, you can alert on your 95th percentile Latency, which is a much better measure of how bad things are for your users.
However, percentiles are more complex than they appear, and of course Vivid Cortex has a blog on this: Why Percentiles Don’t Work the Way you think they do, where, for example, he warns that your system is really doing a percentile of an average over your measurement time (e.g. 1 or 5 minutes). But it’s still useful for alerting and you should try it if you can (and you’ll often be shocked how bad your percentiles are).
Ideally, you can also start using modern Anomaly Detection on your shiny new Golden Signals. Anomaly Detection is especially useful to catch problems that occur off-peak or that cause unusually lower metric values. Plus they allow much tighter alerting bands so you find issues much earlier (but not so early that you drown in false alerts).
However, Anomaly Detection can be pretty challenging, as few on-premises monitoring solutions can even do it (Zabbix cannot). It’s also fairly new, still-evolving, and hard to tune well (especially with the ’seasonality’ and trending so common in our Golden Signals).
Fortunately, newer SaaS / Cloud monitoring solutions such as DataDog, SignalFX, etc. can do this, as can new on-premises systems like Prometheus & InfluxDB.
Regardless of your anomaly tooling, Baron Schwartz has a good book on this that you should read to better understand the various options, algorithms, and challenges: Anomaly Detection for Monitoring.
In addition to alerting, you should also visualize these signals. Weave Works has a nice format, with two graph columns, and Splunk has a nice view. On the left, a stacked graph of Request & Error Rates, and on the right, latency graphs. You could add in a 3rd mixed Saturation / Utilization graph, too.
You can also enrich your metrics with Tags/Events, such as deployments, auto-scale events, restarts, etc. And ideally, show all these metrics on a System Architecture Map like Netsil does.
As a final note on alerting, we’ve found SRE Golden Signal alerts more challenging to respond to because they are actually symptoms of an underlying problem that is rarely directly exposed by the alert.
This often means engineers must have more system knowledge and be more skilled at digging into the problem, which can easily lie in any of a dozen services or resources
Engineers have always had to connect all the dots and dig below (or above) the alerts, even for basic high CPU or low RAM issues. But the Golden Signals are usually even more abstract, and it’s easy to have a lot of them, i.e. a single high-latency problem on a low-level service can easily cause many latency and error alerts all over the system.
One problem the Golden Signals help solve is that too often we only have useful data on a few services and a front-end problem creates a long hunt for the culprit.
Collecting signals on each service helps nail down which service is the most likely cause (especially if you have dependency info), and thus where to focus.
That’s it. Have fun with your signals, as they are both challenging and interesting to find, monitor, and alert on.
Below are the appendix articles for various popular services, and we are working to add more over time. Again, we welcome feedback on these.
Note there are lots of nuances & challenges to getting this data in a usable way, so apologies in advance for all the notes and caveats sprinkled throughout as we balance being clear with being somewhat thorough.
Note also that you have to do your own processing in some cases, such as doing delta calculations when you sample counter-based metrics (most systems will do this automatically for you).
On the list:
- Load Balancers â€Š–â€ŠAWS ALB/ELB, HAProxy
- Web Servers –â€ŠApache & Nginx
- App Serversâ€Š –â€ŠPHP, FPM, Java, Ruby, Node, Go, Python
- Database Serversâ€Š –â€ŠMySQL & AWS RDS
- Linux Servers–â€ŠAs underlying Resources
Since this is a long and complex set of articles, there are undoubtedly different views and experiences, thus we welcome feedback and other ideas. We will revise the text based on feedback and other’s experience, so check back from time-to-time for updates on your favorite services.
This article originally appeared on Medium.