Originally published on Failure is Inevitable.
In our article on SLOs, we discussed the need for service level indicators to be relevant to the users’ experience. By consolidating a number of internal metrics into one indicator that reflects the typical use of the service, we can ensure that meeting our SLO means keeping users happy.
A good way to think about this is by looking at the user’s experience or journey. Not only will this reveal the correct metrics to consolidate into your indicator, but it can provide insights into the pathways and pain points of your structure. This goes beyond generating reliability data: an empathic understanding of user experiences can help guide future development projects.
It’s important to note that a single SLI cannot capture the entire user journey alone. A typical user of your service might care about the latency of the site’s response, the availability of key functions, and the liveness of data they’re accessing. Their happiness with the service during this journey depends on all three, but there’s no way to monitor them as one. In order for your SLO to be a functional objective, your SLI must be a singular metric captured by the service’s monitorable data.
At the same time, creating SLIs for every possible metric is just as troublesome. As Chris Jones et. al explain in the Google SRE Handbook, “You shouldn’t use every metric you can track in your monitoring system as an SLI; an understanding of what your users want from the system will inform the judicious selection of a few indicators. Choosing too many indicators makes it hard to pay the right level of attention to the indicators that matter.” There are nearly endless subsets of metrics you can consolidate into your SLI. Understanding the perspective of users can help you choose.
Generally, SREs will use white box monitoring to precisely observe all the components of the service from the front end to the back. This allows engineers to pinpoint the exact stage at which the service begins to lag or fail. However, this internal perspective can obscure the effects these issues actually have on the users. Black box monitoring can be a useful way to gain insight into your users’ perspective.
With black box monitoring, you act as an external user of the service with no access to the internal monitoring tools. This allows you to concentrate on a few metrics that directly correlate with user happiness. For example, this could mean observing the speed with which the entire service loads, rather than the response time of each and every component.
By restricting yourself to black box metrics, you know that everything you’re seeing has an impact on users. SLOs should be targeting as closely as possible the threshold of user pain, as this gives the most development acceleration in error budgeting. If a disruption doesn’t reach the point where a black boxed user could see it, it won’t be causing users any pain.
User journeys aren’t just statistics based on monitoring data – the insights they provide are reliant on the idea that they simulate a specific person using the service in a typical way. But how can you be sure that this simulated person is actually plausible? Kate Kaplan, writing for the Nielsen Norman Group, provides a number of good techniques for building up user personas.
Although this article isn’t focused specifically on the tech industry, these strategies can still provide lots of good data insight for product designers. The technique of direct observation could be particularly useful in tech, where monitoring services can provide metrics for every observed action.
When you take a look through the eyes of your user, you aren’t just finding the right SLIs, but creating key information for constructing a user journey. A user journey is a powerful tool for many aspects of product design as it helps designers focus on users’ priorities. The lessons you learn from developing and analyzing user journeys can be insightful in the most fundamental areas of product design, but for these insights to be accurate, the underlying data must be carefully selected.
An article by Nick Babich on UX Planet provides a good guide on building a user’s journey. SLIs are most relevant to steps four and five: creating a list of touchpoints, and taking user intention into account.
The touchpoints between the user and your service will involve requests and responses – the building blocks of SLIs. For each touchpoint you identify, you should be able to break down the specific SLIs measuring that interaction. From there, you can follow each branch that the user could take, gathering the SLIs for the following requests into a bundle for that journey.
Next, to understand user intent, you must identify potential pain points within the service. Your bundle of SLIs can be instrumental in finding pains that might otherwise be invisible.
Let’s say that a user’s channel involves making a dozen requests to the same service component – like clicking through many pages of search results. Separately, these requests return faster than the SLO set for them, maybe under a second, and a user looking at just one or two pages will be satisfied with this speed. However, if your user journey involves looking through twenty pages, the annoyance of nearly a second wait repeated twenty times could be intolerable. Only with both relevant monitoring data and broader perspectives could you discover this point of user frustration.
Finding these pain points along the user journey could lead to a radical redesign of the service as a whole. Additionally, it opens up a path to solutions deep in the backend and helps determine priorities for development. In our example above, you could either redesign the catalog to avoid the need to look through twenty pages, or you could optimize the components serving those pages until the total delay for twenty pages is still acceptable.
All the same tools and processes will help you here: starting with an identified pain point, you can find the relevant SLIs, set more aggressive SLOs for them, and use white box monitoring to diagnose where improvements have to be made. These steps can even lead you to re-evaluate your SLIs, choosing more carefully with respect to users’ wants. New SLIs can help find new pain points in future user journey exercises, continuing the cycle of improvement.
Carefully choosing SLIs is instrumental to having SLOs that keep both users and developers happy. Considering a user’s perspective not only helps you choose SLIs, but can provide transformative insights into product design.
If you’d like to dive deeper into practices from industry leaders on choosing good SLIs and setting effective SLOs, check out our ebook Encountering SRE!