We a current contest going with New Relic centered largely around observability, and it got me wanting to discuss observability in general.
Announcing the New Relic Hack the Planet Contest!
Jess Lee for The DEV Team ・ Jan 28 '21
#hacktheplanet
#newrelic
#meta
#challenge
How do you develop a mental model around observability in software development? How do you teach new folks about observability?
Top comments (13)
Great question! Here's my mental model of observability.
Observability includes, but isn't limited to, logging errors, counting metrics, and data visualization. It encapsulates anything that lets you see how your app is performing around engineering goals. Engineering goals could include uptime, error rate, read/write times, API call times etc.
What's so awesome about observability is the impact it has on an engineering org. You can contribute a small but well thought out metric, and it could help uncover some crazy bug that has been plaguing devs for years!! But before I get too excited, I want to make the distinction between observability for a software dev versus a devops engineer. Because I think that's an important differentiation that would help teach someone new to the space.
Both of these roles have overlaps but they have some key differences. Devops are primarily concerned about infrastructure; will my database be able to handle x number of requests? While software devs are concerned about the application; how long does an API request take for this route?
Observability gives you the pieces of information needed to solve these questions. And by combining multiple metrics, logs, or errors you get a complete understanding of an issue. It can also help you catch issues before your users find them!
I've been rambling - so I'll end on one last thought. Starting with nothing when it comes to observability is rough. That's why tools like New Relic are really great solutions! They give you a lot of metrics out of the box.
Two dimensional arrays.
But not 'spatial x spatial',
Rather 'spatial x temporal'
You end up with essentially a
timeline
:An 'observer' creates a timeline (usually by calling
next()
,complete()
, orerror()
as events are observed) - and the timeline is then an 'observable.'In JS, RxJS provides an excellent library for this, as well as a swath of 'operators' (creating a new 'timeline' from a given 'timeline') that makes for really great dev exp - both in terms of writing declarative code, and in terms of dealing with a frontend environment that's driven by observed user and/or server events!
but by doing this you need to save 'em in a database or other similar sort, si as to visualize it later on in real time, right?
I struggle with defining and implementing SLIs and SLOs. This seems like it would be simple. Building a Kubernetes baked platform which makes observability much easier than a bunch of old VMs I am dealing with.
It is about creating a feedback loop for the change you introduced or you plan to introduce by defining and measuring the outcome you expect.
From a system point of view, there are some common expectations that someone can enumerate depending on the type and scale of the system.
Some additions to my mental model above:
Observability is the necessary instrumentation to enable monitoring of a change or a system whereas traceability is the necessary instrumentation to enable tracing across systems.
The bigger the scale, the more metrics we care about and the more needs we have from a monitoring stack and observability capability. If we put Devops, application developers, product managers and business owners in a spectrum, they need different metrics to take their own decisions but the variety of metrics is smaller on the right. The types of data and the types of aggregations are limited but there are many ways to have dashboards and/or alerts depending on the volume and velocity of data. This means that there is no one size fits all for everything. Don't reinvent the wheel as you don't want to panic while panicking :)
Some types of observability metrics enable or even automate the rollout of changes.
Operational aspects like IaC, fluency on navigating a big number of metrics, on creating new ones and on executing runbooks play an important role as most such metrics might be designed to be ignored for 99% of their time until the right moment. Information that cannot lead eventually to any decision is useless.
To me a system has observability when I can ask questions of collected telemetry that I didn't know I was going to need to ask before hand. The telemetry may take many forms such as application logs, db query logs, traces, metrics, probes or even user analytics!
In an ideal world all of the telemetry you collect would have as much context as possible and common threads such that given an error log I can step back through traces, graph related metrics, partition by user type or geographic location etc. This is possible for logs, traces and analytics but at present most metrics stores will choke on high cardinality dimensions.
It's an important capability that a lot of the current tools are lacking... to be able to drill in to telemetry with questions like, how many users were affected, does it affect a specific type of user or all users, does it affect all services or just a couple, what were response times like around these specific events etc.
When I speak to folks about observability I tend to frame it as thinking about what information they'd throw in to a debug statement to troubleshoot their service. Often times (not always, but often) this is good information for observability. This information can then be wrapped in to whatever telemetry tools you have available.
Of course, collecting all of this telemetry is worthless if you don't have the tools to explore it and answer the questions you have... which brings me back around to my starting paragraph. Observability is the ability to ask questions of collected telemetry that I didn't know I was going to need to ask. Just collecting metrics, logs and traces is not enough for a system to have observability.
Such thorough insights in the comments thread as always 😄
Personally, I feel observability is more of an evolving science than a perfect one as metrics that make sense today might not do so 6 months later. Also, it's better to have 1 working dashboard than 4-5 partial ones.
Lastly, the mindset should be to consider the platform's health/uptime as a top priority & use the right tools to ensure that.
I hadn't thought before about there being a mental model or developing it. I guess our model is around health of a system and how to check on the health, like a doctor or car driver what's to know about a patient / car and get early signs of things going awry and enough evidence of details to figure out the cause - the why.
Some of what I'll mention below will be obvious but I hope it explains how we match a need for transparency of errors and reporting with the tools we have.
We did have a team book club around Observabilty.
In my team and for the org, we have SLAs around uptime. So we want our software to be up and performant. We want New Relic and now Datadog instead to ping us when something is down or irregular so we can investigate.
A vague error is useless, so we add layers of monitoring and logging. The server side logs such as from pods or DB or Apache go somewhere that we can inspect by hand (annoying) or ideally view on CloudWatch or Datadog. Then we can search and sort and can build dashboard.
Each of the monitoring tools tells us something about the health of the system.
APM and kubernetes metrics tells us about server side performance and error rates and if servers are restarting.
Synthetic monitoring tells us that a machine can reach our site or endpoint in a timely manner and get a success message. This is typically what we use to page ourselves.
RUM and JS errors are on thing in New Relic but two in Datadog, but we use those to tell us about the experience for all users on the site according to their actual browsers not according to our test that runs every 5 min. We can see pages that are slow or most visited and assets loaded and if any page has an usual amount of JS console errors.
We also have a need to know if a service is appearing stable in one way but irregular in another.
Like we are setting up anomaly detection to tell us when an error rate exceeds an acceptable threshold.
Or when the number of items on a dead letter queue or queue exceeds a certain amount for a period. Because our system is too slow or not able to recover well when it hits errors or low volume.
Oh that's the other thing. We correlate metrics. So we be detectives and figure out - did the increase in volume correspond to an increase in error rate? Did one cause the other or just coincidence?
Why is it that one synthetic test location or one of the servers is consistently slower than the other? Or one server gets less volume of requests than the others, even though they are all weighted equally?
Great question, I guess you need to know (find out) first what you need to know :-) otherwise you end up collecting gazillions of "data" which end up not being useful ...
I must admit, I have no idea what observability means in this context. I kinda just read it as "monitoring", but I might just have the wrong idea there.
Yeah I was also taking it as "monitoring" but that might be too limited a view? or not :-)