Observability vs. Monitoring, is it about Active vs. Passive or Dev vs. Ops ?
Steve Mushero Sep 13 '17
There is lots of chatter these days on Observability, Events, Metrics, Monitoring, and the like, including a great post by Cindy Sridharan on Monitoring & Observability.
She tackles a range of history and differences between the two, including more formal definitions, but I’d like to look at it another way, partly from the perspective of the system being monitored.
Thinking directionally, Monitoring is the passive collection of Metrics, logs, etc. about a system, while Observability is the active dissemination of information from the system. Looking at it another way, from the external ‘supervisor’ perspective, I monitor you, but you make yourself Observable.
For Monitoring, it doesn’t matter if metrics are pulled or pushed, as that’s a communications strategy, but they are almost entirely passively-collected properties or conditions about the system, its resources, state, performance, errors, latency, etc. Loved and tended to by Ops teams and the most modern Developers. All of the basic SRE/USE signals fit into this category.
For Observability, the system, code, developers, etc. are taking step to make things available to make the system more observable. This often starts with increasingly rich and structured logs, plus events or markers, JMX data points, and Etsy-style emitted metrics. Loved and tended to by Developers and the most modern Ops.
Monitoring is most often used for alerting, troubleshooting, capacity planning, and other traditional IT Ops functions, usually not too deeply.
Observability elements, on the other hand, are often much detailed, more diverse, and used more for debugging, complex troubleshooting, performance analyses, and generally going ‘deeper’.
Perhaps Intent also matters, in that Observing a system could mean enhanced Monitoring of behaviors (often via metrics), e.g. how does it behave under this or that condition, with those inputs, or if I twiddle these knobs ? Maybe this is a 3rd category of thing.
Now of course things are not really this simple, as there is overlap, especially as things get higher up the stack. CPU resource utilization is pretty firmly in the Monitoring camp, but what about App Performance, if we even know what that is ?
Or, are you emitting enough info, perhaps via monitored metrics, so Developers or Ops have enough information to know what’s wrong, maybe even how to fix it. Boundaries are a bit elusive, there along a continuum.
Cindy’s post talks about context, which is also increasingly important. Monitoring and Observability both need and mutually enhance it, including providing a much richer set of contextual observations before and during outages, slow downs, and random problems of the day.
As things get more complex, with more moving parts, and especially more distributed, we need more Observability. We also need more and better monitoring, at higher levels of the stack, and deeper levels of the system, at which point it might look a lot like Observability.
In the end, we need them all, and what it’s called doesn’t matter much. Different teams may use and focus on different terms, directions, and intents, but it’s all in the name of making our systems faster, more reliable, and easier to build, manage, and troubleshoot.