DEV Community: Steve Mushero

How to Monitor the SRE Golden Signals

Steve Mushero — Tue, 14 Nov 2017 04:30:51 +0000

Site Reliability Engineering (SRE) and related concepts are very popular lately, in part due to the famous Google SRE book and others talking about the “Golden Signals” that you should be monitoring to keep your systems fast and reliable as they scale.

Everyone seems to agree these signals are important, but how do you actually monitor them? No one seems to talk much about this.

These signals are much harder to get than traditional CPU or RAM monitoring, as each service and resource has different metrics, definitions, and especially tools required.

Microservices, Containers, and Serverless make getting signals even more challenging, but still worthwhile, as we need these signalsâ€Š–â€Šboth to avoid traditional alert noise and to effectively troubleshoot our increasingly complex distributed systems.

This series of articles will walk through the signals and practical methods for a number of common services. First, we’ll talk briefly about the signals themselves, then a bit about how you can use them in your monitoring system.

Finally, there is a list of service-specific guides on how to monitor the signals, for Load Balancers, Web & App Servers, DB & Cache Servers, General Linux, and more. This list and the details may evolve over time as we continually seek feedback and better methods to get better data.

First, what are the SRE Signals ?

There are three common lists or methodologies:

From the Google SRE book: Latency, Traffic, Errors, and Saturation
USE Method (from Brendan Gregg): Utilization, Saturation, and Errors
RED Method (from Tom Wilkie): Rate, Errors, and Duration

You can see the overlap, and as Baron Schwartz notes in his Monitoring & Observability with USE and RED blog, each method varies in focus. He suggests USE is about resources with an internal view, while RED is about requests, real work, and thus an external view (from the service consumer’s point of view). They are obviously related, and also complementary, as every service consumes resources to do work.

For our purposes, we’ll focus on a simple superset of five signals:

Rateâ€Š –â€ŠRequest rate, in requests/sec
Errors –â€ŠError rate, in errors/sec
Latency â€Š–â€ŠResponse time, including queue/wait time, in milliseconds.
Saturation â€Š–â€ŠHow overloaded something is, which is related to utilization but more directly measured by things like queue depth (or sometimes concurrency). As a queue measurement, this becomes non-zero when you are saturated, often not much before. Usually a counter.
Utilization â€Š–â€ŠHow busy the resource or system is. Usually expressed 0–100% and most useful for predictions (as Saturation is probably more useful). Note we are not using the Utilization Law to get this (~Rate x Service Time / Workers), but instead looking for more familiar direct measurements.

Of these Saturation & Utilization are often the hardest to get, and full of assumptions, caveats and calculation complexitiesâ€Š–â€Štreat them as approximations, at best. However, they are often most valuable for hunting down current and future problems, so we put up with learn to love them.

All these measurements can be split and/or aggregated by various things. For example, HTTP could be spilt out to 4xx & 5xx errors, just as Latency or Rate could be broken out by URL.

In addition, there are more sophisticated ways to calculate things. For example, errors often have lower latency than successful requests so you could exclude errors from Latency, if you can (often you cannot).

As useful as these splits or aggregates are, they are outside the scope of this a article as they get much closer to metrics, events, high-cardinality analysis, etc. Let’s focus on getting the basic data first, as that’s hard enough.

Now we have our Signals, what do we do with them?

You can skip ahead to the actual data collection guides at the end of this post if you’d like, but we should talk about what to do with these signals once we have them.

One of the key reasons these are “Golden” Signals is they try to measure things that directly affect the end-user and work-producing parts of the systemâ€Š–â€Šthey are direct measurements of things that matter.

This means, in theory, they are better and more useful than lots of less-direct measurements such as CPU, RAM, networks, replication lag, and endless other things (we should know, as we often monitor 200+ items per server; never a good time).

We collect the Golden Signals for a few reasons:

Alertingâ€Š–â€ŠTell us when something is wrong
Troubleshootingâ€Š–â€ŠHelp us find & fix the problem
Tuning & Capacity Planningâ€Š–â€ŠHelp us make things better over time

Our focus here is on Alerting, and how to alert on these Signals. What you do after that is between you and your signals.

Alerting has traditionally used static thresholds, in our beloved (ha!) Nagios, Zabbix, DataDog, etc. systems. That works, but is hard to set well and generates lots of alert noise, as you (and anyone you are living with) are mostly likely acutely aware.

But start with static if you must, based on your experience and best practices. These often work best when set to levels where we’re pretty sure something is wrong ,or at least unusual is going on (e.g. 95% CPU, latency over 10 seconds, modest size queues, error rates above a few per second, etc.)

If you use static alerting, don’t forget the lower bound alerts, such as near zero requests per second or latency, as these often mean something is wrong, even at 3 a.m. when traffic is light.

Are You Average or Percentile?

These alerts typically use average values, but do yourself a favor and try to use median values if you can, as these are less sensitive to big/small outlier values.
Averages have other problems, too, as Optimizely points out in their blog. Still, averages/medians are easily understood, accessible, and quite useful as a signal, as long as your measuring window is short (e.g. 1–5 minutes).

Even better is to start thinking about percentiles. For example, you can alert on your 95th percentile Latency, which is a much better measure of how bad things are for your users.

However, percentiles are more complex than they appear, and of course Vivid Cortex has a blog on this: Why Percentiles Don’t Work the Way you think they do, where, for example, he warns that your system is really doing a percentile of an average over your measurement time (e.g. 1 or 5 minutes). But it’s still useful for alerting and you should try it if you can (and you’ll often be shocked how bad your percentiles are).

Are you an Anomaly, or just weird?

Ideally, you can also start using modern Anomaly Detection on your shiny new Golden Signals. Anomaly Detection is especially useful to catch problems that occur off-peak or that cause unusually lower metric values. Plus they allow much tighter alerting bands so you find issues much earlier (but not so early that you drown in false alerts).

However, Anomaly Detection can be pretty challenging, as few on-premises monitoring solutions can even do it (Zabbix cannot). It’s also fairly new, still-evolving, and hard to tune well (especially with the ’seasonality’ and trending so common in our Golden Signals).

Fortunately, newer SaaS / Cloud monitoring solutions such as DataDog, SignalFX, etc. can do this, as can new on-premises systems like Prometheus & InfluxDB.

Regardless of your anomaly tooling, Baron Schwartz has a good book on this that you should read to better understand the various options, algorithms, and challenges: Anomaly Detection for Monitoring.

Can I see you?

In addition to alerting, you should also visualize these signals. Weave Works has a nice format, with two graph columns, and Splunk has a nice view. On the left, a stacked graph of Request & Error Rates, and on the right, latency graphs. You could add in a 3rd mixed Saturation / Utilization graph, too.

You can also enrich your metrics with Tags/Events, such as deployments, auto-scale events, restarts, etc. And ideally, show all these metrics on a System Architecture Map like Netsil does.

Fix me, fix you

As a final note on alerting, we’ve found SRE Golden Signal alerts more challenging to respond to because they are actually symptoms of an underlying problem that is rarely directly exposed by the alert.

This often means engineers must have more system knowledge and be more skilled at digging into the problem, which can easily lie in any of a dozen services or resources

Engineers have always had to connect all the dots and dig below (or above) the alerts, even for basic high CPU or low RAM issues. But the Golden Signals are usually even more abstract, and it’s easy to have a lot of them, i.e. a single high-latency problem on a low-level service can easily cause many latency and error alerts all over the system.

One problem the Golden Signals help solve is that too often we only have useful data on a few services and a front-end problem creates a long hunt for the culprit.

Collecting signals on each service helps nail down which service is the most likely cause (especially if you have dependency info), and thus where to focus.
That’s it. Have fun with your signals, as they are both challenging and interesting to find, monitor, and alert on.

Getting the Data from each Service

Below are the appendix articles for various popular services, and we are working to add more over time. Again, we welcome feedback on these.

Note there are lots of nuances & challenges to getting this data in a usable way, so apologies in advance for all the notes and caveats sprinkled throughout as we balance being clear with being somewhat thorough.

Note also that you have to do your own processing in some cases, such as doing delta calculations when you sample counter-based metrics (most systems will do this automatically for you).

On the list:

Load Balancers â€Š–â€ŠAWS ALB/ELB, HAProxy
Web Servers –â€ŠApache & Nginx
App Serversâ€Š –â€ŠPHP, FPM, Java, Ruby, Node, Go, Python
Database Serversâ€Š –â€ŠMySQL & AWS RDS
Linux Servers–â€ŠAs underlying Resources

Since this is a long and complex set of articles, there are undoubtedly different views and experiences, thus we welcome feedback and other ideas. We will revise the text based on feedback and other’s experience, so check back from time-to-time for updates on your favorite services.

This article originally appeared on Medium.

Observability vs. Monitoring, is it about Active vs. Passive or Dev vs. Ops ?

Steve Mushero — Wed, 13 Sep 2017 14:16:35 +0000

There is lots of chatter these days on Observability, Events, Metrics, Monitoring, and the like, including a great post by Cindy Sridharan on Monitoring & Observability.

She tackles a range of history and differences between the two, including more formal definitions, but I’d like to look at it another way, partly from the perspective of the system being monitored.

Thinking directionally, Monitoring is the passive collection of Metrics, logs, etc. about a system, while Observability is the active dissemination of information from the system. Looking at it another way, from the external ‘supervisor’ perspective, I monitor you, but you make yourself Observable.

For Monitoring, it doesn’t matter if metrics are pulled or pushed, as that’s a communications strategy, but they are almost entirely passively-collected properties or conditions about the system, its resources, state, performance, errors, latency, etc. Loved and tended to by Ops teams and the most modern Developers. All of the basic SRE/USE signals fit into this category.

For Observability, the system, code, developers, etc. are taking step to make things available to make the system more observable. This often starts with increasingly rich and structured logs, plus events or markers, JMX data points, and Etsy-style emitted metrics. Loved and tended to by Developers and the most modern Ops.

Monitoring is most often used for alerting, troubleshooting, capacity planning, and other traditional IT Ops functions, usually not too deeply.
Observability elements, on the other hand, are often much detailed, more diverse, and used more for debugging, complex troubleshooting, performance analyses, and generally going ‘deeper’.

Perhaps Intent also matters, in that Observing a system could mean enhanced Monitoring of behaviors (often via metrics), e.g. how does it behave under this or that condition, with those inputs, or if I twiddle these knobs ? Maybe this is a 3rd category of thing.

Now of course things are not really this simple, as there is overlap, especially as things get higher up the stack. CPU resource utilization is pretty firmly in the Monitoring camp, but what about App Performance, if we even know what that is ?

Or, are you emitting enough info, perhaps via monitored metrics, so Developers or Ops have enough information to know what’s wrong, maybe even how to fix it. Boundaries are a bit elusive, there along a continuum.

Cindy’s post talks about context, which is also increasingly important. Monitoring and Observability both need and mutually enhance it, including providing a much richer set of contextual observations before and during outages, slow downs, and random problems of the day.

Conclusion

As things get more complex, with more moving parts, and especially more distributed, we need more Observability. We also need more and better monitoring, at higher levels of the stack, and deeper levels of the system, at which point it might look a lot like Observability.

In the end, we need them all, and what it’s called doesn’t matter much. Different teams may use and focus on different terms, directions, and intents, but it’s all in the name of making our systems faster, more reliable, and easier to build, manage, and troubleshoot.

Configuring for Log Levels & Observability

Steve Mushero — Tue, 05 Sep 2017 10:33:44 +0000

In the old days, we debugged with console output, logs, and god-knows what other poor tools we had back in the stone age. But having to suffer through this, we learned some good lessons and had some good ideas.

One dearest to my heart as a developer, debugger and ops guy is logging. Most programs log things somewhere, usually to a single log, usually at a single log level, i.e. the log you get is all you’re gonna get; and is probably both too much and too little.

With today’s log and observability tools like Honeycomb.io, ELK, Sumo, etc. of course you can handle, search, and tag a lot more, but this is not unlimited, especially for debugging levels and across networks — you simple cannot send all your debug logging at scale to Sumo Logic. You need more granularity, and ways to change it easily.

The first step people took to improve this was adding log levels, following the the usual hierarchy of INFO, WARNING, ERROR, DEBUG and so on. This helps, much more so if you can change it while the darn system is running (critical on larger, more mission-critical stuff). Store this in your R`edis cache and let admins change at run-time

Next was to add more complex filters and configs with tools like log4j and its clones across various languages. This helps, though often involved complex config files and requires restarts, plus really thinking in advance of what to log where, in what subsystems, what log4j configs, etc. Some of this is fixed now with logstash and external senders, but still messy to me.

I was always more interested in dynamic per-module logging via flags and log levels, because I want the logs I want, when I want them, and not more.

For example, usually I’m mostly interested in a single module or issue, so what I really want is to have ERROR debugging on most stuff, but DEBUG on my code.

This can be done a variety of ways, but we’d often re-use the existing single log level directive with a more complex config line, something like this:

DEF=1:SECURE=2:NET=2:RULES=5

Which means default is low level 1, security and networking at 2, and my rule module at level 5. Very simple; easy to parse, etc. You can even use sub-sub systems, like “Rules/Evaluator=5 to be more specific.

In your code, there’s no need for If-then/Macro logic (just like using log4j, etc.), so you just send send everything to the logger, tagged with its log level and module name (via introspection or other language-specific magic).

Then in the logger, you check the module and log level tag and emit what the configuration calls for. You can always spice this up with different configs by channel, e.g. what to send to standard out vs. syslog vs. to the DB, etc.

That’s it. Too bad I almost never see this in any of the hundreds of tools and systems I’ve run over the years, including a long list of current ones I’d love to have this included in.

Hi, I'm Steve

Steve Mushero — Thu, 20 Apr 2017 15:33:27 +0000

Greetings everyone, I'm Steve Mushero, Engineer.

I'm based in Shanghai, China, where I moved to from Silicon Valley about 10 years ago (though heading back to SF/SV this year). Before that I was in Seattle and New York for many years. Originally I'm from the great state of Maine.

I guess I somewhat qualify as a beard-less graybeard these days, though those guys are really old, and I'm still very young, at heart.

Actually, I JUST missed the punched card era, as my university (Rensselaer, RPI to use) moved to real-time terminals and away from cards a year or two before I arrived in the 1980s - Alas, I never punched a card in anger, though I wish I had worked in cards as I'd surely be a more accurate typist today.

I also managed to just avoid real work on mainframes or in COBOL or FORTRAN as a real programmer. Though did play on PDP-lls, VAX, Prime 750s, IBM 4325s, etc. and built lots of interfaces to COBOL, IMS, and related systems.

I start programming on Radio Shack Tandy TRS-80 Model 1s in about 1980, some 37 years ago. BASIC, of course - I've used every version of BASIC up until late-model Visual Basic in the early 21st Century.

Wrote my first commercial software at about 16, a dating / matching system for high-schools, which was a popular service at the time - did that on a DEC RAINBOW, I think in DOS mode, using the floppy drive as data swap; I think I had 16-32KB of RAM, though I don't recall exactly - not super easy to pack a dozen questions / answers for a thousand students and match them - took 24 hours or so to run, I think (mostly floppy time).

I also did a lot of work on Industrial PLCs, in Ladder Logic, something I encourage you to learn as a totally different way of thinking, and doing logic. I loved Ladder and used a very structured approach to quite large-scale complex systems. Most unusual feature of PLC/Ladder is you update the programs while they (and the machines they control) are running - can be quite dangerous, even lethal.

I did hard-core manufacturing & automation engineering along the way, on large-scale machinery, in power, motors, fluids, gas, air, pumps, etc. That's still my first love & I love the smell and sounds of hydraulics in the morning.

Then I worked in Client-Server as it became hot, mostly in PowerBuilder and Sybase, including supporting others users in my spare time on CompuServe. We built very large insurance processing systems in these technologies, for a product still in use today, 25 years later, processing billions in premiums.

Finally on to the Internet in 1995 as an early architect of what was called "Push Technology" for which I have a few patents; the simplest of which evolved into RSS via NetScape.

Been doing Internet stuff, Silicon Valley stuff, and now Cloud Stuff, ever since, usually as CTO or Chief Architect. Lots of fun stuff for MicroFinance, World Health, Biotech, and lots more.

Now I build systems to manage the clouds and IT Operations - today we are in PHP, Laravel, React, MySQL, JS, and so on.

Still so much to learn - decades of tech and I still love building things . . .

I'm at www.SteveMushero.com & www.linkedin.com/in/stevemushero/ & @stevemushero

Choosing TSDBs - InfluxDB for Us

Steve Mushero — Thu, 20 Apr 2017 05:03:53 +0000

Time-Series Databases are powerful and interesting beasts. We are selecting a new one for our OpsStack Total Operations Platform, for use in all parts of the system, from gathering field metrics, to driving our AI & Expert Systems, to handling our internal app logs, function timing, and other data gathering that makes things work (taking after Etsy in this regard).

We’ve looked at all the various players, including Prometheus, OpenTSDB, Graphite, and others. In the end, we chose InfluxDB and its related tools, for a variety of reasons I wanted to lay out here.

First, we need to tag on multiple dimensions, which is the new standard and thus makes obsolete the older Graphite-like tags-in-metric-name concepts. We all need a metric name and many, highly-variable tags around that, which are indexed for rapid lookup, like hostname, region name, http request path, log fingerprint, etc. InfluxDB and most new TSDBs support this convenient multi-tag concept.

InfluxDB also allows multiple data fields, making it easy to gather muilti-field data like CPU / RAM use or SQL query types per second. The tag vs. field and indexing / aggregation models are all clear and convincing.

Second, we get data from everywhere, and as InfluxDB is a latecomer, it supports all sorts of data feeds, from its own Telegraf to statsd to collectd to various HTTP endpoints to UDP (perfect and necessary for logging from app code). This lets us integrate with other systems and over time migrate to the most appropriate, plus use the ever-increasing Telegraf ecosystem where we can.

The InfluxDB mostly-automatic aggregation / reduction system is similar to what others do and very helpful in data crunching, something we are very familiar with from our large-scale monitoring systems.

Third, using a query language as close to SQL as possible is genius, as it just makes it easier to use while avoiding endless mistakes and challenges from just being different. I detest JSON and various random query languages, or even worse, in-code custom logic functions that resemble a bad ORM. Just say no and use SQL as much as you can, and no one gets hurt.

Using SQL is a huge plus

Related to using SQL is that most commands are very similar to MySQL, which we know and love, e.g. ‘show databases’ or ‘use dbname’. This just makes life easier, increasing efficiency while reducing mistakes and confusion.

Fourth, we looked closely at the increasingly-popular Prometheus, but as a monitoring system, the pull-only agent model is a deal breaker for a SaaS system. Our old systems worked this way, and we just cannot continue to ask customers to open ports for us; the push gateway is not really a solution. In addition, and partly due to the pull model, Prometheus does not allow sending timestamps with the data, which makes it useless for batch gathering and sending, which we need for high-resolution gathering, not to mention in poor-connectivity environments on a global scale.

No push model is deal-breaker for Prometheus Model

Fifth, using Go makes InfluxDB absurdly easy to install and configure, and it has nice packages for every platform, including native MacOS, Windows, various Linux distributions, etc. Easy, easy, and works as advertised. Very different from the systems dependent on Hadoop, or systems that use Go, Python, Java, and Ruby all misguidedly mixed together, for example.

Finally, the docs are really quite good, as expected for a commercial provider.

Of course, InfluxDB is pretty new and has had some challenges / changes in their clustering and storage models, so we’ll see how those work out at scale, but for now we are willing to put with this for what is, to us, a product best-suited to our needs. And we still have to work out if we’ll really commit Influx’s Kapacitor for alerting, but we’ll see.

Learn more about our Total Ops Platform at OpsStack.io