SRE + Honeycomb: Observability for Service Reliability

#sre #reliability #observability #devops

As a Customer Advocate, I talk to a lot of prospective Honeycomb users who want to understand how observability fits into their existing Site Reliability Engineering (SRE) practice. While I have enough of a familiarity with the discipline to get myself into trouble, I wanted to learn more about what SREs do in their day-to-day work so that I’d be better able to help them determine if Honeycomb is a good fit for their needs.

After doing some reading on my own, I asked my fellow Bees for their thoughts on various definitions of SRE floating out in the wild. Luckily for me, a couple of teammates who’ve worked as SREs chimed in: Principal Developer Advocate Liz Fong-Jones, Developer Advocate Shelby Spees, and Lead Integrations Engineer Paul Osman. Liz recommended a blog post from her time at Google featuring a video where she and Seth Vargo explain how SRE and DevOps relate. I definitely recommend giving it a watch:

After that, Shelby offered to discuss the topic further 1-on-1 and I was happy to take her up on it. For more color, below is the conversation between Shelby & me (two bees in a pod) 🐝

Jenni and Shelby discuss SRE and Honeycomb

Shelby: The Google SRE book establishes how they build reliable systems at Google, but the actual work of SREs has been around for decades. It’s just that this specific term "site reliability engineering" was coined and popularized by Google. SRE is really about helping accomplish business goals by making software systems work better, because "reliability" is defined by the end user's experience.

One of the biggest takeaways for me from the Google book that I hadn’t seen elsewhere was the introduction of SLOs (service level objectives) and error budget. SREs are the ones who determine: How many engineering brain cycles do we want to sink into this new code? Versus do we spend our time making our customers happier by eliminating tech debt? That’s kind of the motivation behind SLOs, that decision-making process. ⚖️

A rough example of an SLO for a Netflix-like streaming company might be: Can your customers watch their TV show with <10 sec of buffering? For the viewer, it doesn’t matter what part of your system is causing the buffering to happen, they just want to watch the show. So with SLOs, it was a novel idea just be able to measure whether or not people are having a bad experience based on what your systems are reporting.

Building on that, we can start to measure “good enough” vs. “too broken.” Issues and incidents are rarely all-or-nothing nowadays. If our service is degraded, how soon will our customers notice and complain? Refreshing once in a multi-hour streaming session is reasonable, but twice per episode probably isn’t. So SREs do the work of figuring out how to convert the measurements you have coming out of your system into something that gives you insight into your user experience in real-time.

Perhaps an SLO where 99.9% of streaming sessions have <10s of buffering is fine, people can hit refresh if there are buffering errors, but it’s not catastrophic. SREs are the ones who measure how much risk are we willing to take to build a new feature, or how degraded can we allow our service to get. While it’s not usually as simple as “new features vs. reliability,” that’s a good way to understand it when you’re first learning about this stuff.

At the end of the day, all of the work we do is a living experiment. SREs have a finger on the pulse to see if this is good enough to make a decision on the trade-off between building new code or improving system reliability. #everythingisanexperiment #honeycombcorevalues

Jenni: Yea! And this really reminds me of the time when I was interviewing Erol Blakely, Director of SRE at ecobee for their customer case study.

Erol said two things that stuck out to me about their SRE practice: 1) that SLOs are his three favorite letters 😂 and 2) that before Honeycomb, they had paralysis by analysis whenever they would try to define their SLOs, because their methodology and tooling for creating SLIs (service level indicators) and then creating SLOs would take weeks of white-boarding and head-banging 🤘 to try to get the right measurements in place and accurate. He said it also took a lot of time to even see if their engineering objectives were actually something the customer or the business cared about.

All of this was 10x harder on ecobee to solve for because their tools and dashboards lacked the functionality, making it impossible to iterate on their SLIs and SLOs without more head-banging and yadayadayada. So when Erol saw Honeycomb’s rapidly iterative SLO feature, its baked anomaly detector, AND built-in error budgets, he said it was love at first sight 😍 and that he “would’ve needed to hire a full-time Grafana and Prometheus admin to do the work that Honeycomb could do for him.”

Shelby: Totally right! SLOs help teams better conversations around your engineering effort, but it’s not always easy to define them. And it really just gives the whole company a high-level view of the system and the purpose of the system. SLOs offer support to the teams who are building these systems on how to make them more resilient and understand the impact of changes.

Jenni: Teams keep telling me “Jenni, we’re practicing SRE” and I keep seeing “SRE” titles, so then why do some people still consider it a niche role?

Shelby: Here’s a good explanation from Paul Osman (Lead Integrations Engineer) on SREs:

"The SRE role varies wildly from company to company. Some organizations have adopted the Google model fairly consistently, in other orgs it's closer to what Facebook call 'Production Engineering'. In some orgs, it's re-branded 'ops engineering', in some orgs it's closer to what some people call 'infrastructure engineering'. And of course, it's everywhere in between. In my mind, what defines an SRE vs those other roles is that an SRE team should be working, through systems and processes, to help determine how much time and effort should be spent on reliability and helping the larger eng org move that work forward."

The amount of people doing SRE according to how Google wrote about it is a small minority of actual SRE teams. Part of the issue with making decisions based on “What’s the impact to our customer, services, business?” is that you have to actually talk to business stakeholders and convince them to balance feature development with paying down technical debt.

Regardless of whether a team is practicing SRE “by the book” or not, these are difficult conversations to have. It’s not easy to get buy-in, which is why I appreciate how well SLOs map to business KPIs and product user stories. The ability to translate between business priorities and engineering effort is a core skill for SREs.

In reality, companies hire SREs and then silo them into ops or devops-y roles without the authority to influence prioritization decisions, so teams have to continue fighting fires indefinitely. You see the worst version of this in SRE job descriptions where they say "engineer CI/CD pipelines" but they mean "manage our Jenkins configurations because we don’t want the developers ever thinking about builds."

Other times, a person with lots of experience with ops work gets hired into an SRE role with the opportunity to make a real impact. Since they’re an ops person, they’re drawn to interesting infrastructure and platform solutions like Kubernetes, which may indeed solve process problems and help the systems work more reliably at scale. But these solutions also add complexity, so the SRE who set it all up ends up becoming a Kubernetes engineer full-time--now their expertise is needed to keep the system afloat.

SRE is such a cross-cutting function, you really need to find people with both the expertise as a software practitioner and the big-picture business mindset.

Jenni: Obviously a lot easier said than done. Conversations are difficult enough in general, imagine trying to navigate all of those competing interests (and personalities) and how to balance your growth as a company versus making sure your service is reliable. It’s a crucial tradeoff for the longevity of the business.

Shelby: Oh yea. But SLOs are a very useful tool for this, especially when they build on the instrumentation that people are already using. Which is why investing in observability can pay dividends in your reliability work.

When you have observability into the business logic of your system (not just system level metrics) AND the high cardinality data on how all your endpoints are behaving or even what individual customers are experiencing, that allows you to go and debug critical production issues quickly because you’re already collecting rich data on it.

Then once you’ve interacted with the data frequently and find yourself asking similar questions over and over, you can then define an SLI based on that query or question you’re asking. Since you’ve already instrumented your code to send all that context all the time, you don’t have to do any extra work to add SLI measurements. Then you can use the SLO tool in Honeycomb to show you the compliance with your measurement over time! 😉

Plus (my favorite part): Honeycomb’s SLO feature automatically performs BubbleUp over the events considered by your SLI, showing you which fields stand out in the events that fail your SLO. If you get an alert, you can look at the dimensions in SLO BubbleUp and that gives you a really good starting point for your investigation. So now you’ve tied together business impact (via your SLO) with the developer tooling you already use to observe your systems in production.

You should check out this incident report from July. Martin was on-call, and this was the first middle of the night page we’d had in a while. It was an SLO burn alert, so Martin opened up the SLO that was alerting. BubbleUp showed him that the bad events were in a specific availability zone, so he said “OK, I’ll just remove that availability zone from the autoscaling group.” He got a read on the situation in less than five minutes and started working on a fix. 💪

And so that’s why Liz encouraged us to adopt SLOs as a feature: observability makes investigating issues a lot easier. This means that you're not just measuring what's important to the business, you're also empowering your team to actually have a positive impact on service reliability, both during incidents and in day-to-day development work.

Jenni: So, can you get the same SLO result without instrumenting your code?

Shelby: It depends on the questions you’re asking. The observability approach to SLOs kinda requires structured data. Most people don’t have the quality of data to have that level of granularity or insight into the user experience. Instead, people will write their SLO like, “we should have 99.9% uptime,” although I think the language is usually more declarative than that. But that’s great! The important thing is to start measuring and learning. Many teams don’t have the bandwidth to sit down and ask, “how many nines of uptime did we have last year?” and they'll likely get a different answer from each monitoring tool they check.

So I would tell people: start with something, and if it turns out you’re up more than what you defined for your SLO then maybe you have the error budget for taking more risks! You could do a chaos experiment to learn more about your system or test out some performance enhancements you’ve been wanting to try. But, if you don’t meet your SLO and find out that it was way too ambitious as a goal, you still learned something. You can adjust the SLO and reset your error budget. Maybe your standards are too high, or you may need to do some reliability work to meet your goal.

Also, it’s important to remember that SLOs are an internal tool, to help teams to have better conversations about service reliability. If your hypothesis was incorrect or your goal was too ambitious, that’s not a bad thing as long as you're learning. It’s only a problem if you fail to adjust your approach.

Jenni: It almost sounds like the learning is more important than meeting the SLO itself.

Shelby: I mean, it kind of is. Some teams have so little insight into how their systems run in production that paying attention to any of this stuff will make a huge difference. You don’t need to be at Google scale to benefit.

If you don’t have good observability, then start with something simple like Pingdom checks: Is it even up?? Look at the history if you have it, set a goal, and then start mapping your reliability efforts to your SLO results. That mapping part is hard, though, which is where observability comes into play. You can gradually add richer instrumentation either through agents on your system or preferably through instrumenting your code. That allows you to start asking more sophisticated questions and creating SLIs that map closer to user experience.

Jenni: So SLOs aren’t like a contract, they’re an internal tool. That makes it a lot less intimidating to start experimenting with different kinds of instrumentation and SLIs.

Shelby: The goal is to learn how your systems work in production, and instrumenting for observability lets you have more informed SLIs. Also, you’re not gonna get it right the first time, it’s an iterative process. And it doesn’t have to be a lot of work up-front! Leaning on your observability tooling and instrumentation means you don’t have to do a bunch of whiteboard work to engineer your own SLO system and host it all yourself.

Jenni: So you mentioned chaos engineering. Tbh I’m not totally familiar with how it relates.

Shelby: Oh yea! Chaos engineering is a growing field for SREs that builds on the important work of testers and QA. What makes a great tester is that they break a program in novel ways that the developer never expected, which helps the developer write better code or build a better UI.

It’s similar with chaos engineering, except now we’re talking about your services and infrastructure. Chaos experiments involve purposely breaking things in prod 💥 to see how bad it can get without your customers noticing. And similar to traditional testing, it’s not just about breaking things, it’s about learning. With chaos engineering, the goal is to learn the boundaries of your services. And (as you might expect) your learning is facilitated by having really good observability!

Jenni: I see! SRE involves a lot of experimentation. 🤔

Shelby: I mean, one common thread among the SREs I know is that they really care about their work. And we’re starting to see that many SREs are interested in Honeycomb. So Jenni, when you’re talking to them, it’s all about helping them see how observability supports their existing reliability work.

At the end of the day (and this blog post) SREs are in a position to make things better for developers and for the business. That feedback loop in production, a shared sense of building vs. fixing, and the work of automation goes a long way to helping build more resilient and reliable systems.

Jenni: And the beat goes on. Hey Shelby, we should do this more often.

Shelby: Yea this was fun. I’m gonna call you the next time I feel a rant coming on 😂

Jenni: Well, you got my number. 867-5309.