The case for observability

#observability

This article is my attempt at building a generic business case that you can take to different parts of your org to convince them to invest in observability. The cases are at the end if you want to skip my rambling explanations. It's written as a response to this tweet:

Charity Majors

@mipsytipsy

@sugendran How would you make the business case?

04:24 AM - 28 Dec 2019

0 0

The reason we need a business case is that anyone who hasn't bought into the idea of observability will see this as a cost and they need to understand why they should spend time and money on it. They will need to make a trade-off against other activities they could be doing with that time and money. If they don't see the value you see then you're just going to be fighting them the whole way.

The case will need to speak to a few different stakeholders. You're going to need exec buy-in above all else. You want them backing you up when you argue that the definition of done means your work has observability baked in. You're also going to need product and engineering managers on board as you're going to need to spend some engineering time to get observability set up. Finally, you're going to need to get your peers to understand why and how to use it. The worst thing that can happen is you get an observability solution up and running, and then nobody uses it when it matters.

Side note: My approach to writing these business cases is to first think about the narrative that I want to tell. My aim with the narrative is to get them to understand why something is either important, or urgent and important. The way we do that is to try and understand what is of value to them, and working out how the narrative shows them value.

The narrative for how I think about observability is: You would never build a business without a way to understand the behaviour of your clients/users. Building a distributed system is no different, there are complex interactions and we need to understand our clients/users move through that system. Without this, we're just guessing about how to optimise, where issues lay, and how to fix production issues.

I am not going to cover any solutions in the cases below. There are different ways to achieve observability and there are plenty of docs covering this. What's important is that they understand why we need observability not how. You should create two options that make sense for your organisation, each will have different costs in time and money. This provides the data to help make tradeoffs.

For your peers:

Observability is the ultimate way to understand exactly what happens to the user/client as they move about the system. We can trace exactly what they're doing in high fidelity. This means that we can profile real-user behaviour and optimise our code for how they use the product. During an incident, we can narrow our traces to just the affected users and use the detail to understand exactly what they did. Without observability, we are simply guessing based on aggregations and symptoms. Observability will reduce our toil and make our lives easier.

For your engineering and product managers:

Observability is the ultimate way to understand exactly what happens to the user/client as they move about the system. We make product decisions based on data and we should make engineering decisions on data as well. Our system is complex and at times hard to reason about, but that's okay as there are tools to help us. These observability tools give us true insight into how our services perform for each user/client that uses the system. Every time we build a new service, the user's interactions with that service need to be observable. By doing that we can ensure that we are making the most impactful engineering changes and can quickly diagnose issues that occur.

For your executives:

Every feature, we add complexity to our system making it harder to reason about, hard to optimise and harder to diagnose issues. Observability helps us make data-based engineering decisions instead of guesses based on aggregations. It will help us optimise our complex system and diagnose issues that affect users/clients in production. Whether we use a third-party service or run open-source components is up to us, and we can adjust costs to suit our use case. The risk of not adding observability tools is that we will not have the level of data required to properly optimise the system or to quickly diagnose issues in production. Both of these should be seen as opportunity costs that we can and should reduce. The tools exist, and we should use them.

Hopefully, I've made a compelling argument. There is no doubt more that could be said, and I'm hoping that when all three groups talk they'll repeat the same message to each other. You'll likely get a no, what's important is that you move the no to a not now, but soon. There are a few things you can do to move the conversation forward. Find the leaders that have argued against, seek to understand why - if they're arguing it's likely they have thought about it and have concerns they want to talk about. Adjust the case to take into account their concerns. If the task seems too daunting for your org then start small, get tickets into each sprint to do the steps that Charity Major's suggests in this tweet:

Charity Majors

@mipsytipsy

-- they're keying in on something real, so let's not sweat the rest. What you need to do is bootstrap a solution compelling enough to convert the unbelievers.

You need to:

✅ solve a problem that is currently intractable
✅ swiftly (an hour or two)
✅ cheaply (free? 🤞)

11:12 AM - 20 Dec 2019

2 37