<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Danyel Fisher</title>
    <description>The latest articles on DEV Community by Danyel Fisher (@danyelf).</description>
    <link>https://dev.to/danyelf</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F434528%2Febaf5f43-6e8d-4dd7-941a-9323a74fbd18.png</url>
      <title>DEV Community: Danyel Fisher</title>
      <link>https://dev.to/danyelf</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/danyelf"/>
    <language>en</language>
    <item>
      <title>They Aren’t Pillars, They’re Lenses</title>
      <dc:creator>Danyel Fisher</dc:creator>
      <pubDate>Tue, 22 Dec 2020 23:40:29 +0000</pubDate>
      <link>https://dev.to/honeycombio/they-aren-t-pillars-they-re-lenses-opf</link>
      <guid>https://dev.to/honeycombio/they-aren-t-pillars-they-re-lenses-opf</guid>
      <description>&lt;p&gt;To have Observability is to have the ability to understand your system’s internal state based on signals and externally-visible output. Honeycomb’s approach to Observability is to strive toward this: every feature of the product attempts to move closer to a unified vision of figuring out what your system did, and how it got there. Our approach is to let people smoothly move between aggregated views of their data, like heat-maps and line charts, into views that emphasize collections of events, like traces and BubbleUp, into views that emphasize single events, like raw data.&lt;/p&gt;

&lt;p&gt;In the broader marketplace, though, Observability is often promoted as “three pillars” — separating logging, monitoring, and tracing (aka logs, metrics &amp;amp; traces) as three distinct capabilities. We believe that separating these capabilities misses out on the true power of solving a problem with rich observability.&lt;/p&gt;

&lt;p&gt;The metaphor I like is to think of each feature as a lens on your data. Like a lens, they remove some wavelengths of information in exchange for emphasizing others. To debug in hi-res, you need to be able to see all the vivid colors.&lt;/p&gt;

&lt;p&gt;Let’s say, for example, that you’re tracking a service that seems to be acting up. An alert has gone off, saying that some users are having a poor experience. &lt;strong&gt;Monitoring tools &lt;/strong&gt;that track &lt;strong&gt;metrics&lt;/strong&gt;—the first pillar-- will interpret data as a time series of numbers and gauges — and that’s really important, because it’s useful to know such things as how long a process takes to launch or how long a web page takes to load. Using a metrics monitoring tool (e.g. Prometheus) will help generate that alert. If the monitoring tool supports &lt;a href="https://www.vividcortex.com/blog/what-is-cardinality-in-monitoring" rel="noopener noreferrer"&gt;high cardinality&lt;/a&gt; — the ability to track hundreds or thousands of different values — you can even find out which endpoints those users are encountering and, perhaps, some information about which users.&lt;/p&gt;

&lt;p&gt;You could think of that as a magnifying glass with a blue lens on your data. It comes out looking something like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ta761hxs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/02/monet_blue.jpg" class="article-body-image-wrapper"&gt;&lt;img class=" wp-image-6003" src="https://res.cloudinary.com/practicaldev/image/fetch/s--ta761hxs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/02/monet_blue.jpg" alt="" width="220" height="264"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The second pillar is traces or &lt;a href="https://www.honeycomb.io/blog/get-deeper-insights-with-honeycomb-tracing/" rel="noopener noreferrer"&gt;tracing&lt;/a&gt;, which looks at individual calls and dives into how they are processed. From inside a tracing tool (e.g. Jaeger), you can do wonderful things — you can see which component took the longest or shortest, and you can see whether specific functions resulted in errors. In this case, for example, we might be able to use the information we found from the metrics to try to find a trace that hits the same endpoint. That trace might help us identify that the slow part of the trace was the call to the database, which is now taking much more time than before.&lt;/p&gt;

&lt;p&gt;(Of course, the process of getting from the metrics monitoring tool to the tracing tool is bumpy: the two types of tools collect different data. You need to find out how to correlate information in the metrics tool and the tracing tool. The process can be time-consuming and doesn’t always give you the accuracy you need. The fields might be called different things, and might use different encodings. Indeed, the key data might not be available in the two systems.)&lt;/p&gt;

&lt;p&gt;In our lens analogy, that’s a red lens. From this lens, the picture looks pretty different — but there’s enough in common that we can tell we’re looking at the same image. There are some parts that stand out and are much more visible; other aspects of detail entirely disappear.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--9X5Vtuaz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/02/monet_red.jpg" class="article-body-image-wrapper"&gt;&lt;img class="wp-image-6004 " src="https://res.cloudinary.com/practicaldev/image/fetch/s--9X5Vtuaz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/02/monet_red.jpg" alt="" width="220" height="264"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But why did the database calls get slow? To continue debugging, you can look through logs, which is the third pillar. Maybe scrolling around in the logs, you might find some warnings issued by the database to show that it was overloaded at the time, or logs showing that the event queue had gotten long. That helps figure out what had happened to the database — but it’s a limited view. If we want to know how often this problem had arisen, we’d need to go back to the metrics to learn the history of the database queue.&lt;/p&gt;

&lt;p&gt;Like before, the process of switching tools, from tracing to logging, requires a new set of searches, a new set of interactions and of course more time.&lt;/p&gt;

&lt;p&gt;We could think of that as a green lens.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--V5AuMaRW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/02/monet_green.jpg" class="article-body-image-wrapper"&gt;&lt;img class=" wp-image-6005" src="https://res.cloudinary.com/practicaldev/image/fetch/s--V5AuMaRW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/02/monet_green.jpg" alt="" width="220" height="264"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When companies sell the “three pillars of observability”, they lump all these visualizations together, but as &lt;em&gt;separate&lt;/em&gt; capabilities:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ta761hxs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/02/monet_blue.jpg" class="article-body-image-wrapper"&gt;&lt;img class="alignnone wp-image-6003" src="https://res.cloudinary.com/practicaldev/image/fetch/s--ta761hxs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/02/monet_blue.jpg" alt="" width="220" height="265"&gt;&lt;/a&gt;    &lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--9X5Vtuaz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/02/monet_red.jpg" class="article-body-image-wrapper"&gt;&lt;img class="alignnone wp-image-6004" src="https://res.cloudinary.com/practicaldev/image/fetch/s--9X5Vtuaz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/02/monet_red.jpg" alt="" width="220" height="264"&gt;&lt;/a&gt;    &lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--V5AuMaRW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/02/monet_green.jpg" class="article-body-image-wrapper"&gt;&lt;img class="alignnone wp-image-6005" src="https://res.cloudinary.com/practicaldev/image/fetch/s--V5AuMaRW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/02/monet_green.jpg" alt="" width="220" height="264"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That’s not a bad start. Some things are completely invisible in one view, but easy to see in others, so placing them side by side can help alleviate those gaps. Each image brings different aspects more clearly into view: the blue image shows the outline of the flowers best; the red shows the detail in the florets; and the green seems to get the shading and depth best.&lt;/p&gt;

&lt;p&gt;But these three separate lenses have limitations. True observability is not just the ability to see each piece at a time; it’s also the ability to understand the whole and to see how the pieces combine to tell you the state of the underlying system.&lt;/p&gt;

&lt;p&gt;The truth is, of course, there aren’t three different systems interacting: there is one underlying system in all its richness. If we separate out these dimensions — if we collect metrics monitoring separately from log and traces — then we lose the fact that this data reflects the single underlying system.&lt;/p&gt;

&lt;p&gt;We need to collect and preserve that richness and dimensionality. We need to move through the data smoothly, precisely, and efficiently. We need to be able to discover where a trace has a phenomenon that may be occurring over and over in other traces, and to &lt;a href="https://docs.honeycomb.io/working-with-your-data/tracing/explore-trace-data/#returning-to-the-query-builder" rel="noopener noreferrer"&gt;find out where and how often&lt;/a&gt;. We need to break down a monitoring chart into its &lt;a href="https://docs.honeycomb.io/working-with-your-data/bubbleup/" rel="noopener noreferrer"&gt;underlying components&lt;/a&gt; to understand which factors really cause a spike.&lt;/p&gt;

&lt;p&gt;One way to implement this is to maintain a single set of telemetry collection and storage that keeps rich enough data that we can view it as metrics monitoring, tracing, or logging — or in some other perspective.&lt;/p&gt;

&lt;p&gt;Honeycomb’s event store acts a single source of truth for everything that has happened in your system. Monitoring, tracing, logging are simply different views of system events being stored — and it’s easy to switch quickly and easily between different views. Tracing isn’t a separate experience of the event store: it’s a different lens that brings certain aspects into sharper focus. Any point on a heat-map or a metric line-chart connects to a trace and any span on a trace can be turned into a query result.&lt;/p&gt;

&lt;p&gt;This single event store also enables Honeycomb to provide unique features such as BubbleUp. This is the ability to visually show a slice &lt;i&gt;across&lt;/i&gt; the data — in other words how two sets of events differ from each other, across all their various dimensions (fields). That’s the sort of question that metrics systems simply cannot show (because they don’t store the individual events), and let’s face it that would be exhausting in a dedicated log system.&lt;/p&gt;

&lt;p&gt;--&lt;br&gt;
What do you do when you have separate pieces of the complete picture? You need to manually connect the parts and make the connections, looking for correlates. In our lens analogy, that might be like seeing that an area shows as light colored in both the green and the red lens, so it must be yellow.&lt;/p&gt;

&lt;p&gt;You COULD do that math yourself. Flip back and forth. Stare at where bits contrast.&lt;/p&gt;

&lt;p&gt;Or, you could use a tool where seeing the image isn’t a matter of skill or experience of combining those pieces in your head: it’s all laid out, so you can see it as one complete beautiful picture.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--OWEBxB99--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/02/MON1478-1000x1000-1.jpg" class="article-body-image-wrapper"&gt;&lt;img class="wp-image-6006 alignleft" src="https://res.cloudinary.com/practicaldev/image/fetch/s--OWEBxB99--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/02/MON1478-1000x1000-1.jpg" alt="" width="220" height="264"&gt;&lt;/a&gt;&lt;i&gt;&lt;/i&gt;&lt;/p&gt;

&lt;p&gt;&lt;i&gt;Claude Monet, &lt;/i&gt;&lt;/p&gt;

&lt;p&gt;&lt;i&gt;“&lt;a href="https://www.metmuseum.org/art/collection/search/437112" rel="noopener noreferrer"&gt;Bouquet of Sunflowers&lt;/a&gt;,” 1881&lt;/i&gt;&lt;/p&gt;

&lt;p&gt;Join the swarm. Get started with &lt;a href="https://ui.honeycomb.io/signup?&amp;amp;utm_source=Devto&amp;amp;utm_Devto=blog&amp;amp;utm_campaign=referral&amp;amp;utm_keyword=%7Bkeyword%7D&amp;amp;utm_content=they-arent-pillars-theyre-lenses"&gt;Honeycomb for free&lt;/a&gt;!&lt;/p&gt;

</description>
      <category>observability</category>
      <category>logging</category>
      <category>monitoring</category>
      <category>tracing</category>
    </item>
    <item>
      <title>Challenges with Implementing SLOs</title>
      <dc:creator>Danyel Fisher</dc:creator>
      <pubDate>Mon, 07 Dec 2020 20:44:21 +0000</pubDate>
      <link>https://dev.to/honeycombio/challenges-with-implementing-slos-1kp</link>
      <guid>https://dev.to/honeycombio/challenges-with-implementing-slos-1kp</guid>
      <description>&lt;p&gt;A few months ago, Honeycomb released our SLO — Service Level Objective — feature to the world. We’ve written before about &lt;a href="https://www.honeycomb.io/slo/" rel="noopener noreferrer"&gt;how to use it&lt;/a&gt; and some of the use scenarios. Today, I’d like to say a little more about how the feature has evolved, and what we did in the process of creating it. (Some of these notes are based on my talk, “Pitfalls in Measuring SLOs;” you can find the slides to that talk &lt;a href="https://www.honeycomb.io/talks/" rel="noopener noreferrer"&gt;here&lt;/a&gt;, or view the video on our &lt;a href="https://www.honeycomb.io/talks/" rel="noopener noreferrer"&gt;Honeycomb Talks page)&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Honeycomb approaches SLOs a little differently than some of the market does, and it’s interesting to step back and see how we made our decisions.&lt;/p&gt;

&lt;p&gt;If you aren’t familiar with our SLO feature, I’d encourage you to check out the &lt;a href="https://www.honeycomb.io/production-slos/" rel="noopener noreferrer"&gt;SLO webcast&lt;/a&gt; and our &lt;a href="https://www.honeycomb.io/slo/" rel="noopener noreferrer"&gt;other documentation&lt;/a&gt;. The shortest summary, though, is that an SLO is a way of expressing &lt;i&gt;how reliable&lt;/i&gt; a service is. An SLO comes in two parts: a metric (or indicator) that can measure the &lt;i&gt;quality&lt;/i&gt; of a service, and an expectation of &lt;i&gt;how often&lt;/i&gt; the service meets that metric.&lt;/p&gt;

&lt;p&gt;When &lt;a href="https://twitter.com/lizthegrey" rel="noopener noreferrer"&gt;Liz Fong-Jones&lt;/a&gt; joined Honeycomb, she came carrying the banner of SLOs. She’d had a lot of experience with them as an SRE at Google, and wanted us to support SLOs, too. Honeycomb had an interesting secret weapon, though: the fact that Honeycomb stores rich, wide events [REF] means that we can do things with SLOs that otherwise aren’t possible.&lt;/p&gt;

&lt;p&gt;The core concept of SLOs is outlined in the &lt;a href="https://landing.google.com/sre/books/" rel="noopener noreferrer"&gt;Google SRE book and workbook&lt;/a&gt;, and in Alex Hidalgo’s upcoming &lt;a href="http://shop.oreilly.com/product/0636920337867.do" rel="noopener noreferrer"&gt;Implementing Service Level Objectives&lt;/a&gt; book. In the process of implementing SLOs, though, we found that there were a number of issues that aren’t well-articulated in the Google texts; I’d like to spend a little time analyzing what we learned.&lt;/p&gt;

&lt;p&gt;I’ve put this post together because it might be fun to take a look behind the scenes — at what it takes to roll out this feature; at some of the dead ends and mistakes we made; and how we managed to spend $10,000 of AWS on one particularly embarrassing day.&lt;/p&gt;

&lt;p&gt;As background, I’m trained as a human-computer interaction researcher. That means I design against user needs, and build based on challenges that users are encountering. My toolbox includes a lot of prototyping, interviewing, and collecting early internal feedback. Fortunately, Honeycomb has a large and supportive user community — the “Pollinators” — who love to help each other, and give vocal and frequent feedback.&lt;/p&gt;

&lt;h3&gt;Expressing SLOs&lt;/h3&gt;

&lt;p&gt;In Honeycomb, you can express static &lt;a href="https://docs.honeycomb.io/working-with-your-data/triggers/" rel="noopener noreferrer"&gt;triggers&lt;/a&gt; pretty easily: simply set an aggregate operation (like COUNT or AVERAGE) and a filter. The whole experience re-uses our familiar query builder.&lt;/p&gt;

&lt;p&gt;We tried to go down the same path with SLOs. Unfortunately, it required an extra filter and — when we rolled it out with paper prototypes — more settings and screens than we really wanted.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--X6mYhQZV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/mly8h2l60nd32xe8getp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--X6mYhQZV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/mly8h2l60nd32xe8getp.png" alt="for post"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We decided to reduce our goals, at least in the short term. Users would create SLOs far fewer times than they would view and monitor their SLOs; our effort should be spent on the monitoring experience. Many of our users who were starting out with SLOs would be more advanced users; because SLOs are aimed at enterprise users, we could help ensure our customer success team was ready to help users create them.&lt;/p&gt;

&lt;p&gt;In the end, we realized, an SLO was just three things: an &lt;a href="https://landing.google.com/sre/sre-book/chapters/service-level-objectives/" rel="noopener noreferrer"&gt;SLI (“Service Level Indicator”)&lt;/a&gt;, a time period, and a percentage. "Was the SLI fulfilled 99% of the time over the last 28 days?" An SLI, in turn, is a function that returns TRUE, FALSE, or N/A for every event. This turns out to be very easy to express in Honeycomb &lt;a href="https://docs.honeycomb.io/working-with-your-data/customizing-your-query/derived-columns/" rel="noopener noreferrer"&gt;Derived Columns&lt;/a&gt;. Indeed, it meant we could even create an &lt;a href="https://docs.honeycomb.io/working-with-your-data/slos/cookbook/" rel="noopener noreferrer"&gt;SLI Cookbook&lt;/a&gt; that helped express some common patterns.&lt;/p&gt;

&lt;p&gt;We might revisit this decision at some point in the future — it would be nice to make the experience friendlier as we’re beginning to learn more about how users want to use them. But it was also useful to realize that we could allow that piece of the experience to be less well-designed.&lt;/p&gt;

&lt;h3&gt;Tracking SLOs&lt;/h3&gt;

&lt;p&gt;Our goal in putting together the main SLO display was to let users see where the burndown was happening, explain why it was happening, and remediate any problems they detected.&lt;/p&gt;

&lt;p&gt;This screenshot gives a sense of the SLO screen. At the top-left, the “remaining budget” shows how the current SLO has been burned down over 30 days. Note that the current budget has 46.7% remaining. We can also see that the budget has been gradually burning slowly and steadily.&lt;/p&gt;

&lt;p&gt;The top-right view shows our overall compliance: for each day of the last 30, what did the previous 30 look like? We’re gradually getting better.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--bWMKKWKy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/03/image-12.png" class="article-body-image-wrapper"&gt;&lt;img class=" wp-image-6166 aligncenter" src="https://res.cloudinary.com/practicaldev/image/fetch/s--bWMKKWKy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/03/image-12.png" alt="" width="720" height="244"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Contrast this view: we’ve burned almost three times our budget (-176% means we burned the first 100%, and then &lt;i&gt;another&lt;/i&gt; 1.76 of it).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--vhhRJBcn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/03/image-11.png" class="article-body-image-wrapper"&gt;&lt;img class=" wp-image-6165 aligncenter" src="https://res.cloudinary.com/practicaldev/image/fetch/s--vhhRJBcn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/03/image-11.png" alt="" width="721" height="300"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A lot of that was due to a crash a few weeks ago — but, to be honest, our steady state of burning was probably lower than it wanted to be. Indeed, we’ve &lt;i&gt;never&lt;/i&gt; exceeded our goal of 99.5% at &lt;i&gt;any time&lt;/i&gt; in the last 30 days.&lt;/p&gt;

&lt;h3&gt;Explaining the Burn&lt;/h3&gt;

&lt;p&gt;The top parts of the screen are familiar and occur in many tools. The bottom part of the chart is, to me, more interesting, as they take advantage of the unique aspects that Honeycomb has to offer. Honeycomb is all about the &lt;b&gt;high-cardinality, high-dimensional&lt;/b&gt; data. We love it when users send us rich, complex events: it lets us give them tools to provide rich comparisons between events.&lt;/p&gt;

&lt;p&gt;The chart on this page shows a heatmap. Like other Honeycomb heatmaps, it shows the number of events with various durations (Y axis) by time (X Axis). This time, though, it adds yellow events that failed the SLI. This image shows that the yellow failed events are largely ones that are a little slower than we might expect; but a few are being processed quickly and failing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--pC-h8zIK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/03/image-5.png" class="article-body-image-wrapper"&gt;&lt;img class="size-full wp-image-6158 aligncenter" src="https://res.cloudinary.com/practicaldev/image/fetch/s--pC-h8zIK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/03/image-5.png" alt="" width="616" height="385"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Contrast this image, which shows that most of the failed events happened in a single burst of time. (The time axis on the bottom chart looks only at one day, and is currently set on one of the big crash days).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QJXFM98N--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/03/image-6.png" class="article-body-image-wrapper"&gt;&lt;img class="size-full wp-image-6159 aligncenter" src="https://res.cloudinary.com/practicaldev/image/fetch/s--QJXFM98N--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/03/image-6.png" alt="" width="613" height="397"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Last, we can use Honeycomb’s &lt;a href="https://docs.honeycomb.io/working-with-your-data/bubbleup/" rel="noopener noreferrer"&gt;BubbleUp&lt;/a&gt; capability to contrast events that succeeded to those that failed — across every dimension that is in the dataset! For example, in this chart, we see (in the top left) that failing events had status codes of 400 and 500, while succeeding events had status codes of 200.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--lZjgOEIM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/03/image-7.png" class="article-body-image-wrapper"&gt;&lt;img class="size-full wp-image-6160 aligncenter" src="https://res.cloudinary.com/practicaldev/image/fetch/s--lZjgOEIM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/03/image-7.png" alt="" width="627" height="383"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can also see — zooming into the field &lt;code&gt;app.user.email&lt;/code&gt; in the second row — that this incident of burn is actually due only to one person — encountering a small number of errors.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Wwh24ZAG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/03/image-8.png" class="article-body-image-wrapper"&gt;&lt;img class="alignnone wp-image-6161" src="https://res.cloudinary.com/practicaldev/image/fetch/s--Wwh24ZAG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/03/image-8.png" alt="" width="292" height="199"&gt;&lt;/a&gt; &lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--AZpM5dXP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/03/image-9.png" class="article-body-image-wrapper"&gt;&lt;img class="alignnone wp-image-6162" src="https://res.cloudinary.com/practicaldev/image/fetch/s--AZpM5dXP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/03/image-9.png" alt="" width="324" height="191"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This sort of rich explanation also lets us discover how to respond to the incident. For example, seeing this error, we know several things: we can reach out to the relevant customer to find out what the impact on them was; meanwhile, we can send the issue to the UX team to try to figure out what sequence of actions got to this error.&lt;/p&gt;

&lt;h2&gt;User Responses to SLOs&lt;/h2&gt;

&lt;p&gt;From the earlier versions of SLOs, we got useful and enthusiastic responses to the early betas. One user who had set up their SLO system, for example, wrote: “The Bubble Up in the SLO page is really powerful at highlighting what is contributing the most to missing our SLIs, it has definitely confirmed our assumptions.”&lt;/p&gt;

&lt;p&gt;Another found SLOs were good to show that their engineering effort was going in the right direction: ““The historical SLO chart also confirms a fix for a performance issue we did that greatly contributed to the SLO compliance by showing a nice upward trend line. :)”&lt;/p&gt;

&lt;p&gt;Unfortunately, after that first burst of ebullience, enthusiasm began to wane a little. A third customer finally gave us the necessary insight: “I’d love to drive alerts off our SLOs. Right now &lt;b&gt;we don’t have anything to draw us in&lt;/b&gt; and have some alerts on the average error rate .... It would be great to get a better sense of when the budget is going down and define alerts that way.”&lt;/p&gt;

&lt;h2&gt;Designing an Alert System&lt;/h2&gt;

&lt;p&gt;We had hoped to put off alerts until after SLOs were finished. It had become clear to us — from our experience internally as well as our user feedback — that alerts were a fundamental part of the SLO experience. Fortunately, it wasn’t hard to &lt;a href="https://docs.honeycomb.io/working-with-your-data/slos/#define-burn-alerts" rel="noopener noreferrer"&gt;design an alerting system&lt;/a&gt; that would warn you when your SLO was going to fail in several hours, or when your budget had burned out. We could extrapolate from the last few hours what the next few would look like; after some experimentation, we settled on a 1:4 ratio of baseline to prediction: that is, a one-hour baseline would be used to predict four hours from now; a six hour baseline would be used to predict the next day.&lt;/p&gt;

&lt;p&gt;We built our first alert system, wired it up to check on status every minute ... and promptly arranged a $10,000 day of AWS spend on data retrieval. (Before this, our most expensive query had cleared $0.25; this was a new and surprising cost).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--piQ1APsY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/03/image-10.png" class="article-body-image-wrapper"&gt;&lt;img class="size-full wp-image-6163 aligncenter" src="https://res.cloudinary.com/practicaldev/image/fetch/s--piQ1APsY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/03/image-10.png" alt="" width="702" height="365"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Tony Hoare is known to have said that “premature optimization is the root of all evil;” it turns out that some optimization, however, can come too late. In our case, running a one-minute resolution query across sixty days of data, every minute, was asking a lot of both our query system and our storage.&lt;/p&gt;

&lt;p&gt;We paused alerts; and rapidly implemented a caching system.&lt;/p&gt;

&lt;p&gt;Caching is always an interesting challenge, but perhaps the most dramatic issue we ran into is that Honeycomb is designed as a best-effort querying system which is ok with occasional incorrect answers. (It’s even one of our company values: &lt;a href="https://www.honeycomb.io/blog/honeycomb-values-2018/" rel="noopener noreferrer"&gt;“Fast and close to right is better than perfect!”&lt;/a&gt;). Unfortunately, when you cache a close-to-right value, that means that you keep an incorrect value in your cache for an extended period; occasional incorrect values had an outsize effect on SLO quality. Some investigation showed that our database was able to identify queries that were approximations, and that few retries would usually produce a correct value; we ended up ensuring that we simply didn’t cache approximations.&lt;/p&gt;

&lt;p&gt;(We had other challenges with caching and quirks of the database, but I think those are less relevant from a design perspective.)&lt;/p&gt;

&lt;h3&gt;Handling Flappy Alerts&lt;/h3&gt;

&lt;p&gt;One of our key design goals was to reduce the number of alerts produced by noisy systems. Within a few days of release, though, a customer started complaining that their alert was turning just as noisy.&lt;/p&gt;

&lt;p&gt;We realized they’d been unlucky: their system happened to be such that their burndown was being estimated at &lt;i&gt;just about &lt;/i&gt;four hours. A bad event would pop in — and it would drop to 3:55. A good event would show up, and they’d bump up to 4:05. This flapping would turn on and off the alerts, frustrating and annoying users.&lt;/p&gt;

&lt;p&gt;Fortunately, the fix was easy once we’d figured out the problem: we added a small buffer, and the problems went away.&lt;/p&gt;

&lt;h2&gt;Learning from Experience&lt;/h2&gt;

&lt;p&gt;Last, I’d like to reflect just a little on what we’ve learned from the SLO experience, and some best practices for handling SLOs.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;b&gt;Volume is important.&lt;/b&gt; A very small number of events really shouldn’t be enough to exhaust budget: if two or three failures can do it, then most likely, a standard alert would be the right use case. The SLO should tolerate at least in the dozens of failed events a day. Doing the math backward, an SLO of 99.9% needs a minimum level of traffic of a few tens of thousands of events a day to be meaningful.&lt;/li&gt;
&lt;li&gt;
&lt;b&gt;Test pathways, not users:&lt;/b&gt; It’s tempting to write an SLO per customer, to find out whether any customer is having a bad experience. That seems not to be as productive a path: first, it reduces volume (because each customer now needs those tens of thousands of events); second, if a single customer is having a problem, does that imply something about the system, or the customer? Instead, writing SLOs on paths through the system and on user scenarios seems like a better way to identify commonalities.&lt;/li&gt;
&lt;li&gt;
&lt;b&gt;Iterating is important: &lt;/b&gt;We learned rapidly that some of our internal SLOs were often off by a bit: they tested the wrong things, or had the wrong intuition for what it meant for something to be broken. For example, “status code &amp;gt;= 400” gets user errors (400s) as well as failures in our own system (500s). Iterating on them helped us figure out what we wanted.&lt;/li&gt;
&lt;li&gt;
&lt;b&gt;Cultural change around SLOs can be slow.&lt;/b&gt; Alerts on numbers in the system are familiar; SLOs are new and may seem a little unpredictable. Internally, our teams had been slow to adopt SLOs; after an&lt;a href="https://www.honeycomb.io/blog/incident-report-running-dry-on-memory-without-noticing/" rel="noopener noreferrer"&gt; incident hit&lt;/a&gt; that the SLOs caught long before alarms, engineers started watching SLOs more carefully.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;SLOs are an observability feature for characterizing what went wrong, how badly it went wrong, and how to prioritize repair. The pathway to implementing SLOs, however, was not as straightforward as we’d hoped. My hope by putting this post together is to help future implementors make decisions for their paths — and to help users know a little more about what’s going on behind the scenes.&lt;/p&gt;

&lt;p&gt;Want to know more about what Honeycomb can do for your business? Check out &lt;a href="https://www.honeycomb.io/get-a-demo?&amp;amp;utm_source=Devto&amp;amp;utm_Devto=blog&amp;amp;utm_campaign=referral&amp;amp;utm_keyword=%7Bkeyword%7D&amp;amp;utm_content=challenges-with-implementing-slos/"&gt;our short demo&lt;/a&gt;!&lt;/p&gt;

</description>
      <category>observability</category>
      <category>slo</category>
      <category>operations</category>
      <category>sre</category>
    </item>
    <item>
      <title>Honeycomb SLO Now Generally Available: Success, Defined.</title>
      <dc:creator>Danyel Fisher</dc:creator>
      <pubDate>Fri, 04 Dec 2020 23:24:12 +0000</pubDate>
      <link>https://dev.to/honeycombio/honeycomb-slo-now-generally-available-success-defined-2hc3</link>
      <guid>https://dev.to/honeycombio/honeycomb-slo-now-generally-available-success-defined-2hc3</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/honeycombio/working-toward-service-level-objectives-slos-part-1-4oed"&gt;Previously, in this series&lt;/a&gt;, we created a derived column to show how a back-end service was doing. That column categorized every incoming event as passing, failing, or irrelevant. We then counted up the column over time to see how many events passed and failed. But we had a problem: we were doing far too much math ourselves.&lt;/p&gt;

&lt;p&gt;To address that problem, Honeycomb has now released &lt;a href="https://honeycomb.io/slos"&gt;&lt;b&gt;SLO Support!&lt;/b&gt;&lt;/a&gt;&lt;b&gt; &lt;/b&gt;Unsurprisingly, it is based on precisely the principles we discussed above.&lt;/p&gt;

&lt;p&gt;To recall, the derived column looked something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;IF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
   &lt;span class="k"&gt;EQUALS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"batch"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
   &lt;span class="k"&gt;EQUALS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;method&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"POST"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
   &lt;span class="k"&gt;EQUALS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
   &lt;span class="n"&gt;LT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;duration_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;which meant, “we only count requests that hit the batch endpoint, and use the POST method. If they do, then we will say the SLI has succeeded if we processed it in under 100 ms, and returned a 200; otherwise, we’ll call it a failure.” We counted the percentage of total requests as our SLI success rate. For example, we might say that over the last thirty days, we managed a 99.4% SLI success rate.&lt;/p&gt;

&lt;h2&gt;Formalizing this structure&lt;/h2&gt;

&lt;ul&gt;
    &lt;li&gt;We’ll pick &lt;i&gt;an SLI&lt;/i&gt;. An SLI (Service Level Indicator) consists of the ability to sort all the events in my dataset into three groups: those that are irrelevant, those that pass, and those that fail.&lt;/li&gt;
    &lt;li&gt;Now, we’ll pick a&lt;i&gt; target level&lt;/i&gt; for this SLI. “Of the relevant events, we want &lt;b&gt;99.95% of them to pass&lt;/b&gt;.”&lt;/li&gt;
    &lt;li&gt;Last, we’ll pick a &lt;i&gt;duration&lt;/i&gt; for them: “&lt;b&gt;Over each 30 days&lt;/b&gt;, we expect our SLI to be at 99.95% passing.”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The nice thing about this is that we can &lt;i&gt;quantify how our SLI is doing&lt;/i&gt;. We can look at a dataset, and see what percentage of events have succeeded.&lt;/p&gt;

&lt;p&gt;This is a really useful way to think about systems that are constantly in minor states of error. Ordinary noise happens; this can lead to transient failures or occasional alerts. We can use this structure to ask how much these minor running errors are costing us.&lt;/p&gt;

&lt;p&gt;(When there’s a catastrophic failure, frankly, SLOs are less surprising: every light on every console is blinking red and the phone is buzzing. We’ll use SLOs in those cases to estimate “how bad was this incident.”)&lt;/p&gt;

&lt;h2&gt;Understanding your Error Budget&lt;/h2&gt;

&lt;p&gt;Let’s make the assumption that we expect to see 100,000 relevant events in a given thirty day period. Let’s further say that, say, 700 of them have failed over the last 27 days. Over the next three days, we can afford for another 300 events to fail and still maintain a 99.9% SLO.&lt;/p&gt;

&lt;p&gt;This gets to the concept of an &lt;b&gt;error budget. &lt;/b&gt;In Honeycomb’s implementation, error budgets are &lt;b&gt;continuously rolling&lt;/b&gt;: at any moment, old errors are slowly scrolling away into the past, no longer counting against your budget.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--m1CrLNzQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/01/SLO1-error-budget.png" class="article-body-image-wrapper"&gt;&lt;img class="size-full wp-image-5738 alignright" src="https://res.cloudinary.com/practicaldev/image/fetch/s--m1CrLNzQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/01/SLO1-error-budget.png" alt="graph showing error budget lines" width="468" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In our example, we'll assume that the world looks something like this: The grey line at the top is the total number of events a system has sent. It’s staying pretty constant. The orange line shows errors.&lt;/p&gt;

&lt;p&gt;For this chart, the Y scale on the errors is exaggerated: after all, if you’re running at 99.9%, that means that there’s 1/1000 the number of errors as successes. (The orange line would be very small!)&lt;/p&gt;

&lt;p&gt;33 days ago, there was an incident which caused the number of errors to spike. Fortunately, we got that under control pretty quickly. Two weeks ago, there was a slower-burning incident, which took a little longer to straighten out.&lt;/p&gt;

&lt;h2&gt;Checking the Burn Down graph&lt;/h2&gt;

&lt;p&gt;It would be great to track &lt;i&gt;when&lt;/i&gt; we spent our error budget. Was the painful part of our last month those big spikes? Or was it the fact that we’ve had a small, continuous background burn the other time? How much were those background events costing us?&lt;/p&gt;

&lt;p&gt;The &lt;b&gt;burn down graph&lt;/b&gt; shows the last month, and how much budget was burned every day. If we had looked at the graph last week, we'd have seen that your last 30 days had been burnt, pretty hard, by that first incident, and then again by the second. The rest of the time has been a slow, continuous burn: nothing too bad. That helps us make decisions: are we just barely making budget every month? Is the loss due to incidents, or is it because we are slowly burning away over time?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--uUJ3eOyn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/01/SLO2-burn-down1.png" class="article-body-image-wrapper"&gt;&lt;img class="aligncenter size-full wp-image-5739" src="https://res.cloudinary.com/practicaldev/image/fetch/s--uUJ3eOyn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/01/SLO2-burn-down1.png" alt="graph showing downward trend of burn-down" width="468" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Both of those can be totally fine! For some systems, it’s perfectly reasonable to have a slow, gentle burn of occasional errors. For others, we want to keep our powder dry to compensate for more-severe outages!&lt;/p&gt;

&lt;p&gt;The graph from six days ago was looking dismal. That first incident had burned 40% of budget in a single incident; the usual pace of “a few percent a day” meant that the budget was nearly exhausted.&lt;/p&gt;

&lt;p&gt;But if we look at the burn down graph today, things are looking better! The first incident is off the books, and now we're only making up for the errors of D2. Someday, that too will be forgotten.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ZS9zk5J9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/01/SLO3-burn-down2.png" class="article-body-image-wrapper"&gt;&lt;img class="aligncenter size-full wp-image-5740" src="https://res.cloudinary.com/practicaldev/image/fetch/s--ZS9zk5J9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/01/SLO3-burn-down2.png" alt="graph showing burn-down trend today" width="468" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We should also take a look at how we compare to the goal. For every day, we can compute the percentage of events that has passed the SLI. As you can see, we’re usually above 95% for most 30 day periods. At the trough of the first incident, things were pretty bad — and we lost ground, again, with the second one — but now we’re maintaining a comfortably higher level.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wz9qCxDY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/01/SLO4-overall-budget.png" class="article-body-image-wrapper"&gt;&lt;img class="aligncenter size-full wp-image-5741" src="https://res.cloudinary.com/practicaldev/image/fetch/s--wz9qCxDY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/01/SLO4-overall-budget.png" alt="graph showing error budget overall" width="468" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, all these illustrations have shown moments when our problems were comfortably in the past. While that’s a great place to have our problems, we wouldn’t be using Honeycomb if all our problems were solved. That’s why there are two other important SLO aspects to think about:&lt;/p&gt;

&lt;h2&gt;SLO Burn Alerts&lt;/h2&gt;

&lt;p&gt;When the error rate is gradually increasing, it would be great to know when we'll run out of budget. Honeycomb creates Burn Alerts to show when our SLO will run out of budget. The green line shows the gradually shrinking budget, but on a slightly adjusted window.&lt;/p&gt;

&lt;p&gt;Then, Honeycomb predicts forward. The orange line looks at how our &lt;b&gt;last&lt;/b&gt; hour has been, and then interpolates forward to the &lt;b&gt;next&lt;/b&gt; four hours. In this image, the four hour estimate is going to dip below zero — and so the system warns the user.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--K6h0JpOP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/01/SLO5-prediction.png" class="article-body-image-wrapper"&gt;&lt;img class="aligncenter size-full wp-image-5742" src="https://res.cloudinary.com/practicaldev/image/fetch/s--K6h0JpOP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/01/SLO5-prediction.png" alt="graph showing extrapolation of errore budget exhaustion" width="1940" height="1108"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This can let us know how long until we use up our error budget. It acts as a forewarning against slow failures.&lt;/p&gt;

&lt;p&gt;It’s really useful to have a couple of different time ranges. A 24 hour alert can mean “you’ve got a slow degradation in your service; you might want to fix it. — but worry about it in the morning.” A four hour alert means “it’s time to get cracking” (at Honeycomb, we tend to send 24 hour alerts to Slack channels, but 4 hour alerts to PagerDuty).&lt;/p&gt;

&lt;h2&gt;&lt;b&gt;Find out why it's going wrong&lt;/b&gt;&lt;/h2&gt;

&lt;p&gt;This wouldn’t be Honeycomb if we didn’t provide you tools to dive into an issue. The SLO Page shows a &lt;a href="https://docs.honeycomb.io/working-with-your-data/heatmaps/"&gt;Heatmap&lt;/a&gt; and a &lt;a href="https://docs.honeycomb.io/working-with-your-data/bubbleup/"&gt;BubbleUp&lt;/a&gt; of the last 24 hours, so you can figure out what’s changed and how you want to take it on.&lt;/p&gt;

&lt;p&gt;Here’s a great example: the SLO page for a Honeycomb tool that’s looking at rendering speed. (Yep, we’ve even set an SLO on end user experience!) This is a pretty loose SLO — really, we’re keeping it around to alarm us if our pages suddenly get really bad — but we can see that we’re doing OK against our goals.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--iU445LCg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/01/SLO6-overall.png" class="article-body-image-wrapper"&gt;&lt;img class="aligncenter size-full wp-image-5743" src="https://res.cloudinary.com/practicaldev/image/fetch/s--iU445LCg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2020/01/SLO6-overall.png" alt="screenshot of overall SLO view page" width="1146" height="1306"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The bottom half of the page shows &lt;i&gt;where&lt;/i&gt; the problems are coming from. The BubbleUp heatmap shows the last day of events: higher up are yellow, meaning these events fail the SLI; lower down are blue, meaning they are compliant with the SLI. We can see that mostly this is happening when events are particularly slow.&lt;/p&gt;

&lt;p&gt;We can also look in there and see that it’s &lt;b&gt;one particular page&lt;/b&gt; that seems to be having the worst experience, and one particular user email that’s running slow. That’s a pretty cool insight — it tells us where to look, and how we might want to handle it. It also gives us a sense for what repro cases to look for, and figure out what strange thing this user is doing.&lt;/p&gt;

&lt;h2&gt;Now, define your own SLOs&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://docs.honeycomb.io/working-with-your-data/slos/"&gt;Honeycomb SLOs&lt;/a&gt; are now released and are available to Enterprise/yearly contract customers. We’d love to learn more about how you think about SLOs, and what you use them for.&lt;/p&gt;




&lt;p&gt;Read the final installment in this blog series: &lt;a href="https://dev.to/honeycombio/challenges-with-implementing-slos-1kp"&gt;Challenges with Implementing SLOs&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;New to Honeycomb? &lt;a href="https://ui.honeycomb.io/signup?&amp;amp;utm_source=Devto&amp;amp;utm_Devto=blog&amp;amp;utm_campaign=referral&amp;amp;utm_keyword=%7Bkeyword%7D&amp;amp;utm_content=honeycomb-slo-now-generally-available-success-defined"&gt;Get started for free&lt;/a&gt;!&lt;/p&gt;

</description>
      <category>slo</category>
      <category>observability</category>
      <category>devops</category>
      <category>sre</category>
    </item>
    <item>
      <title>Working Toward Service Level Objectives (SLOs), Part 1</title>
      <dc:creator>Danyel Fisher</dc:creator>
      <pubDate>Fri, 04 Dec 2020 23:19:16 +0000</pubDate>
      <link>https://dev.to/honeycombio/working-toward-service-level-objectives-slos-part-1-4oed</link>
      <guid>https://dev.to/honeycombio/working-toward-service-level-objectives-slos-part-1-4oed</guid>
      <description>&lt;p&gt;In theory, Honeycomb is always up. Our servers run without hiccups, our user interface loads rapidly and is highly responsive, and our query engine is lightning fast. In practice, this isn’t always perfectly the case — and dedicated readers of this blog have learned about how we &lt;a href="https://www.honeycomb.io/search/incident+review"&gt;use those experiences to improve the product&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Now, we could spend all our time on system stability. We could polish the front-end code and look for inefficiencies; throw ever-harder test-cases at the back-end. (There are a few developers who are vocal advocates for that approach!) But we also want to make the product better — and so we keep rolling out so many &lt;a href="https://changelog.honeycomb.io/"&gt;great features&lt;/a&gt;!&lt;/p&gt;

&lt;p&gt;How do we decide when to work on improving stability, and when we get to go make fun little tweaks?&lt;/p&gt;

&lt;h2&gt;From organizational objectives to events&lt;/h2&gt;

&lt;p&gt;So let’s say we come to an agreement with each other about "how bad" things are. When things are bad—when we’re feeling mired in small errors, or one big error that takes down the service—we can slow down on Cool Feature Development, and switch over to stability work. Conversely, when things are feeling reasonably stable, we can believe we have a pretty solid infrastructure for development, and can slow down on repair and maintenance work.&lt;/p&gt;

&lt;p&gt;What would that agreement look like?&lt;/p&gt;

&lt;p&gt;First, it means being able to take a good hard look at our system. Honeycomb has a &lt;a href="https://www.honeycomb.io/blog/toward-a-maturity-model-for-observability/"&gt;mature level of observability&lt;/a&gt;, so we feel pretty confident that we have the raw tools to look at how we’re doing — where users are experiencing challenges, and where bugs are appearing in our system.&lt;/p&gt;

&lt;p&gt;Second, it means coming to understand that no system is perfect. If our goal is 100% uptime at all times for all requests, then we’ll be disappointed, because some things will fail from time to time.&lt;em&gt; But we can come up with statements about quality of service&lt;/em&gt;. Honeycomb had an internal meeting where we worked to quantify this:&lt;/p&gt;

&lt;ul&gt;
    &lt;li&gt;We pretty much &lt;b&gt;never&lt;/b&gt; want to lose customer data. We live and die by storing customer data, so we want &lt;i&gt;every&lt;/i&gt; batch of customer telemetry to get back a positive response, and quickly. Let’s say that we want them to be handled in under 100 ms, without errors, for 99.95% of requests. (That means that in a full year, we could have &lt;b&gt;4 hours&lt;/b&gt; of downtime.)&lt;/li&gt;
    &lt;li&gt;We want our main service to be up pretty much every time someone clicks on honeycomb.io, and we want it to load pretty quickly. Let’s say we want to load the page without errors, within a second, for 99.9% of requests.&lt;/li&gt;
    &lt;li&gt;Sometimes, when you run a query, it takes a little longer. For that, we decided that 99.5% of data queries should come back within 10 seconds and not return an error.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are entirely reasonable goals. The wonderful thing is, they can actually be expressed in Honeycomb’s dogfood servers as &lt;a href="https://www.honeycomb.io/blog/level-up-with-derived-columns-two-neat-tricks-that-will-improve-your-observability/"&gt;Derived Column expressions&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;For example, we would write that first one — the one about data ingest — as “We’re talking about events where &lt;code&gt;request.endpoint&lt;/code&gt; uses the &lt;code&gt;batch&lt;/code&gt; endpoint and the input is a &lt;code&gt;POST&lt;/code&gt; request. When they do, they should return a code 200, and the &lt;code&gt;duration_ms&lt;/code&gt; should be under 100.&lt;/p&gt;

&lt;p&gt;Let’s call this a “Service Level Indicator,” because we use it to indicate how our service is doing. In our &lt;a href="https://docs.honeycomb.io/working-with-your-data/customizing-your-query/derived-columns/reference/"&gt;derived column language&lt;/a&gt;, that looks like&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;IF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
   &lt;span class="k"&gt;EQUALS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"batch"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
   &lt;span class="k"&gt;EQUALS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;method&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"POST"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
   &lt;span class="k"&gt;EQUALS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
   &lt;span class="n"&gt;LT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;duration_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We name that derived column “SLI”, and we can generate a COUNT and a HEATMAP on it&lt;/p&gt;

&lt;p&gt;This looks pretty sane: we see that there are many more points that are true (the indicator is ok! everything is great!) than false (oh, no, they failed!); and we can see that all the points that are slowest are in the “false” group.&lt;/p&gt;

&lt;p&gt;Let’s whip out our Trusty Pocket Calculator: 35K events with “false”; 171 million with true. That’s about a 0.02% failure rate — we’re up at 99.98%. Sounds like we’re doing ok!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ToJjW0OX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2019/09/SLO-1.gif" class="article-body-image-wrapper"&gt;&lt;img class="aligncenter size-full wp-image-5232" src="https://res.cloudinary.com/practicaldev/image/fetch/s--ToJjW0OX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2019/09/SLO-1.gif" alt="animated gif of a graph with some failures" width="1112" height="673"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But there are still some failures. I’d love to know why!&lt;/p&gt;

&lt;p&gt;By clicking over to the BubbleUp tab, I can find out who is having this slow experience. I highlight all the slowest requests, and BubbleUp shows me a histogram for every dimension in the dataset. By finding those columns that are most different from everything else, I can see wehre these errors stand out.&lt;/p&gt;

&lt;p&gt;... and I see that it’s one particular customer, and one particular team. Not only that, but they’re using a fairly unusual API for Honeycomb (that’s the fourth entry, &lt;code&gt;request.header.user-agent&lt;/code&gt;)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--9GsO5xFS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2019/09/SLO-2.png" class="article-body-image-wrapper"&gt;&lt;img class="aligncenter size-full wp-image-5233" src="https://res.cloudinary.com/practicaldev/image/fetch/s--9GsO5xFS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.honeycomb.io/wp-content/uploads/2019/09/SLO-2.png" alt="screenshot of using BubbleUp to identify one problematic user" width="985" height="988"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is great, and highly actionable! I can reach out to the customer, and find out what’s up; I can send our integrations team to go look at that particular package, and see if we’re doing something that’s making it hard to use well.&lt;/p&gt;

&lt;h2&gt;Quantifying quality of service means you can measure it&lt;/h2&gt;

&lt;p&gt;So bringing that back to where we started: we’ve found a way to start with organizational goals, and found a way to quantify our abstract concept: “always up and fast” now has a meaning, and is measurable. We can then use that to diagnose what’s going wrong, and figure out how to make it faster.&lt;/p&gt;

&lt;p&gt;Part 2, coming soon: Wait, why did I have to pull out my pocket calculator? Don’t we have computers for that? Also, this term “SLI”, it feels familiar somehow...&lt;/p&gt;




&lt;p&gt;Read the next post in the series:&lt;br&gt;
&lt;a href="https://www.honeycomb.io/blog/honeycomb-slo-now-generally-available-success-defined/"&gt;Honeycomb SLO Now Generally Available: Success, Defined.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Excited about what the future of operational excellence looks like? Get started with &lt;a href="https://ui.honeycomb.io/signup?&amp;amp;utm_source=Devto&amp;amp;utm_Devto=blog&amp;amp;utm_campaign=referral&amp;amp;utm_keyword=%7Bkeyword%7D&amp;amp;utm_content=working-toward-service-level-objectives-slos-part-1"&gt;Honeycomb for free&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>slos</category>
      <category>observability</category>
      <category>devops</category>
      <category>sre</category>
    </item>
    <item>
      <title>RubyGems surge during Cloudflare outage: an investigation</title>
      <dc:creator>Danyel Fisher</dc:creator>
      <pubDate>Sat, 18 Jul 2020 00:27:04 +0000</pubDate>
      <link>https://dev.to/honeycombio/rubygems-surge-during-cloudflare-outage-an-investigation-2i31</link>
      <guid>https://dev.to/honeycombio/rubygems-surge-during-cloudflare-outage-an-investigation-2i31</guid>
      <description>&lt;p&gt;Today at Honeycomb we saw a huge drop in incoming data volume from several customers. It turned out to be a result of the &lt;a href="https://www.cloudflarestatus.com/incidents/b888fyhbygb8"&gt;Cloudflare outage&lt;/a&gt;.&lt;/p&gt;


&lt;blockquote class="ltag__twitter-tweet"&gt;
      &lt;div class="ltag__twitter-tweet__media"&gt;
        &lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XcrOb1a5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/media/EdKRIK_XgAMHuEd.png" alt="unknown tweet media content"&gt;
      &lt;/div&gt;

  &lt;div class="ltag__twitter-tweet__main"&gt;
    &lt;div class="ltag__twitter-tweet__header"&gt;
      &lt;img class="ltag__twitter-tweet__profile-image" src="https://res.cloudinary.com/practicaldev/image/fetch/s--VhzL9fCz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/profile_images/910366463663407105/o6zFbfsP_normal.jpg" alt="honeycomb profile image"&gt;
      &lt;div class="ltag__twitter-tweet__full-name"&gt;
        honeycomb
      &lt;/div&gt;
      &lt;div class="ltag__twitter-tweet__username"&gt;
        @honeycombio
      &lt;/div&gt;
      &lt;div class="ltag__twitter-tweet__twitter-logo"&gt;
        &lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--P4t6ys1m--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://practicaldev-herokuapp-com.freetls.fastly.net/assets/twitter-f95605061196010f91e64806688390eb1a4dbc9e913682e043eb8b1e06ca484f.svg" alt="twitter logo"&gt;
      &lt;/div&gt;
    &lt;/div&gt;
    &lt;div class="ltag__twitter-tweet__body"&gt;
      Here's what the downstream effects of a &lt;a href="https://twitter.com/Cloudflare"&gt;@Cloudflare&lt;/a&gt; outage looked like to our customer success team: some of our customers processed less traffic (and thus sent us less telemetry), while one had a traffic surge!&lt;br&gt;&lt;br&gt;Our measurements show the impact lasted 25 minutes and is now over. 
    &lt;/div&gt;
    &lt;div class="ltag__twitter-tweet__date"&gt;
      22:09 PM - 17 Jul 2020
    &lt;/div&gt;


    &lt;div class="ltag__twitter-tweet__actions"&gt;
      &lt;a href="https://twitter.com/intent/tweet?in_reply_to=1284248833233489920" class="ltag__twitter-tweet__actions__button"&gt;
        &lt;img src="https://practicaldev-herokuapp-com.freetls.fastly.net/assets/twitter-reply-action.svg" alt="Twitter reply action"&gt;
      &lt;/a&gt;
      &lt;a href="https://twitter.com/intent/retweet?tweet_id=1284248833233489920" class="ltag__twitter-tweet__actions__button"&gt;
        &lt;img src="https://practicaldev-herokuapp-com.freetls.fastly.net/assets/twitter-retweet-action.svg" alt="Twitter retweet action"&gt;
      &lt;/a&gt;
      15
      &lt;a href="https://twitter.com/intent/like?tweet_id=1284248833233489920" class="ltag__twitter-tweet__actions__button"&gt;
        &lt;img src="https://practicaldev-herokuapp-com.freetls.fastly.net/assets/twitter-like-action.svg" alt="Twitter like action"&gt;
      &lt;/a&gt;
      39
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/blockquote&gt;


&lt;p&gt;I did this investigation as a &lt;a href="https://twitter.com/FisherDanyel/status/1284252031624949760"&gt;live thread in the replies&lt;/a&gt; to the above Honeycomb tweet, but I'm copying it here as well. Follow along with us!&lt;/p&gt;


&lt;div class="ltag__user ltag__user__id__434528"&gt;
  
    .ltag__user__id__434528 .follow-action-button {
      background-color: #2e0338 !important;
      color: #ffffff !important;
      border-color: #2e0338 !important;
    }
  
    &lt;a href="/danyelf" class="ltag__user__link profile-image-link"&gt;
      &lt;div class="ltag__user__pic"&gt;
        &lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--WdoYpAhT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://res.cloudinary.com/practicaldev/image/fetch/s--2Qlfin4u--/c_fill%2Cf_auto%2Cfl_progressive%2Ch_150%2Cq_auto%2Cw_150/https://dev-to-uploads.s3.amazonaws.com/uploads/user/profile_image/434528/ebaf5f43-6e8d-4dd7-941a-9323a74fbd18.png" alt="danyelf image"&gt;
      &lt;/div&gt;
    &lt;/a&gt;
  &lt;div class="ltag__user__content"&gt;
    &lt;h2&gt;
&lt;a class="ltag__user__link" href="/danyelf"&gt;Danyel Fisher&lt;/a&gt;
&lt;/h2&gt;
    &lt;div class="ltag__user__summary"&gt;
      &lt;a class="ltag__user__link" href="/danyelf"&gt;/danyelf&lt;/a&gt;
    &lt;/div&gt;
    &lt;p class="ltag__user__social"&gt;
        &lt;a href="https://github.com/danyelf" rel="noopener"&gt;
          &lt;img class="icon-img" alt="github logo" src="https://res.cloudinary.com/practicaldev/image/fetch/s--C74Jn3f1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://practicaldev-herokuapp-com.freetls.fastly.net/assets/github-logo.svg"&gt;danyelf
        &lt;/a&gt;
    &lt;/p&gt;
  &lt;/div&gt;
&lt;/div&gt;


&lt;p&gt;Do you notice that one spike that went &lt;em&gt;upwards&lt;/em&gt;? Someone had increased traffic during the outage.&lt;/p&gt;

&lt;p&gt;By astounding coincidence, it just happens to be &lt;a href="https://rubygems.org/"&gt;RubyGems&lt;/a&gt;, which Honeycomb -- working with &lt;a href="https://twitter.com/rubytogether"&gt;@rubytogether&lt;/a&gt; -- offers as a &lt;em&gt;free public dataset&lt;/em&gt; for anyone to play with.&lt;/p&gt;

&lt;p&gt;Which means that you can, without signing up, &lt;a href="https://ui.honeycomb.io/ruby-together/datasets/rubygems.org/result/byDD5JdUuew"&gt;go explore live data from RubyGems&lt;/a&gt;! And since Honeycomb permalinks work forever, all the links here will open query results in the Honeycomb UI even after the original outage has aged out.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--GGLSvFZU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/media/EdKVE3iUwAALOaw%3Fformat%3Dpng%26name%3D900x900" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--GGLSvFZU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/media/EdKVE3iUwAALOaw%3Fformat%3Dpng%26name%3D900x900" alt="line graph showing the number of events over time, with an increase at 14:15 that continues for about 20 minutes before going back to normal"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We're looking at the &lt;a href="https://www.fastly.com/"&gt;Fastly&lt;/a&gt; logs from the RubyGems site. Each of these represents a request of someone wanting to request something about a gem -- maybe get a list of them, or download one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Exploring by Size
&lt;/h2&gt;

&lt;p&gt;Let's add in a &lt;a href="https://ui.honeycomb.io/ruby-together/datasets/rubygems.org/result/6ewxqcgWrEx"&gt;heatmap&lt;/a&gt; of the &lt;code&gt;response_body_size&lt;/code&gt; -- in other words, how much were people downloading?  Looks like during that time, we saw &lt;em&gt;fewer&lt;/em&gt; requests for big downloads.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ACynQccD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/media/EdKV2gOUwAAVMvV%3Fformat%3Dpng%26name%3D900x900" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ACynQccD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/media/EdKV2gOUwAAVMvV%3Fformat%3Dpng%26name%3D900x900" alt="A chart showing the count of downloads, as well as a heatmap of response_body_size"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's grab a few samples of how the outage is different from before and after the outage. I click to "Bubbleup" -- again, you can do this too, with the same clicks I did! -- and grab that top area. This asks "how are the data points in the yellow area different than the green ones below?"&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--XDx4wQ33--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/media/EdKWhfLU0AMr5Xh%3Fformat%3Dpng%26name%3Dmedium" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XDx4wQ33--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/media/EdKWhfLU0AMr5Xh%3Fformat%3Dpng%26name%3Dmedium" alt="Using BubbleUp on the Heatmap view, selecting the events with the highest response body size"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And we see the top few dimensions where the requests highlighted in yellow look different. That huge file was &lt;a href="https://wkhtmltopdf.org/"&gt;wkhtmltopdf&lt;/a&gt;. I'd love to know why it suddenly fell off the radar. Is there something about big packages that made them more likely to fail?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--tXZomov0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/media/EdKWzqIU0AAhK62%3Fformat%3Dpng%26name%3Dmedium" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--tXZomov0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/media/EdKWzqIU0AAhK62%3Fformat%3Dpng%26name%3Dmedium" alt="The gem name for the largest responses is wk html to pdf"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You know what? &lt;a href="https://ui.honeycomb.io/ruby-together/datasets/rubygems.org/result/vPgytdjWvSs"&gt;Let's check that hypothesis.&lt;/a&gt; Let's list all gems by their size (descending). Now we can scroll through some of the various gems and see if they also had interruptions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Dr0Hx65O--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://p-81fa8j.b1.n0.cdn.getcloudapp.com/items/BluO87dy/2020-07-17_15-41-07%2520%25281%2529.gif%3Fv%3D6877c6385bdafa71bbb58441a64e8497" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Dr0Hx65O--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://p-81fa8j.b1.n0.cdn.getcloudapp.com/items/BluO87dy/2020-07-17_15-41-07%2520%25281%2529.gif%3Fv%3D6877c6385bdafa71bbb58441a64e8497" alt="gif: scrolling over each of the gem names to see the shape of their graphs"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I'll leave it to someone smarter than myself to explain why &lt;code&gt;wkhtmltopdf-binary&lt;/code&gt; had a dropoff, but &lt;code&gt;wkhtmltopdf-binary-edge&lt;/code&gt; saw consistent traffic.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--DGuW95Se--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/media/EdKZvCdUwAAEoyd%3Fformat%3Dpng%26name%3Dmedium" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--DGuW95Se--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/media/EdKZvCdUwAAEoyd%3Fformat%3Dpng%26name%3Dmedium" alt="graph showing the count of various gem names, with wk html to pdf binary and wk html to binary edge visible"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Feel free to try this later!
&lt;/h2&gt;

&lt;p&gt;As a side note: every one of these URLs will persist indefinitely. Honeycomb keeps all query results forever, so it's always easy to share URLs. &lt;/p&gt;

&lt;p&gt;If you find something, it's definitely the best way to share it back! (On Slack, they even have pretty unfurls.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Examining what went up
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://ui.honeycomb.io/ruby-together/datasets/rubygems.org/result/htzomsGaS9b"&gt;Back to the main heatmap.&lt;/a&gt; When I've got some outliers, a log transformation can help make the image much easier to read. So here we've got the response body size -- and the log scaled version.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--FCH3sRq7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/media/EdKbO_bUMAECGDl%3Fformat%3Dpng%26name%3Dmedium" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--FCH3sRq7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/media/EdKbO_bUMAECGDl%3Fformat%3Dpng%26name%3Dmedium" alt="Graphs of the event count as well as the response body size, with and without using a log scale"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That dark blue spot really stands out to me. Why are the requests in that dense area so different from everything else? &lt;/p&gt;

&lt;p&gt;Let's try another BubbleUp: we'll grab the interesting area, select it, and bubbleup shows how those fields are different.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--MtC0Vd7Y--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/media/EdKbjQvU0AApVee%3Fformat%3Dpng%26name%3Dmedium" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--MtC0Vd7Y--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/media/EdKbjQvU0AApVee%3Fformat%3Dpng%26name%3Dmedium" alt="Heatmap view with the graph's dark spot highlighted and some BubbleUp results visible"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Most fields aren't very different.  But then there's that one on the left: &lt;code&gt;request_user_agent&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A quick hover tells us that's &lt;code&gt;Ruby, RubyGems/2.6.14 x86_64-linux Ruby/2.4.10 (2020-03-31 patchlevel 364)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--CsPPCxhR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/media/EdKcCuAUwAAK0d1%3Fformat%3Dpng%26name%3Dsmall" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--CsPPCxhR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/media/EdKcCuAUwAAK0d1%3Fformat%3Dpng%26name%3Dsmall" alt="Hovering over the outlier request user agent to reveal the value"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Um. WTF?  My first hypothesis would be that it's a coincidence: maybe they just happened to release that new version, so these events are disproportionately of that release. (Correlation does not equal causation!)&lt;/p&gt;

&lt;p&gt;So let's break down by user agent. There's a lot of them! They have different temporal patterns -- including a few others that may also have a similar pattern to the one we care about.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--O75boR5Q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/media/EdKdPm5UEAAHgJA%3Fformat%3Dpng%26name%3Dlarge" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--O75boR5Q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/media/EdKdPm5UEAAHgJA%3Fformat%3Dpng%26name%3Dlarge" alt="Query results grouped by user agent"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But still: this one seems special. When Cloudflare went down, it decided to spike like mad. &lt;a href="https://ui.honeycomb.io/ruby-together/datasets/rubygems.org/result/taMVHNF2LzN"&gt;Here it is in Honeycomb.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--K96-t5AE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/media/EdKeJ86UwAALOs2%3Fformat%3Dpng%26name%3Dmedium" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--K96-t5AE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/media/EdKeJ86UwAALOs2%3Fformat%3Dpng%26name%3Dmedium" alt="Query results for the above outlier user agent, with a large spike in the event count graph during the Cloudflare outage"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Was it just this one crazy user? Good thing we can &lt;a href="https://ui.honeycomb.io/ruby-together/datasets/rubygems.org/result/bVCGA8YgLFJ"&gt;break down by user IP address&lt;/a&gt;! (Hashed and sanitized for your protection). Nope.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QQzT-6qe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/media/EdKeR2KUEAE9qJY%3Fformat%3Dpng%26name%3Dmedium" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QQzT-6qe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/media/EdKeR2KUEAE9qJY%3Fformat%3Dpng%26name%3Dmedium" alt="Query results grouped by sanitized IP addresses"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next?
&lt;/h2&gt;

&lt;p&gt;There are a lot of places we could go with this investigation. We could investigate what people downloaded, or what other user agents seem to have been affected, or whether caching policies made a difference.&lt;/p&gt;

&lt;p&gt;We could look at the time it took to serve requests.&lt;/p&gt;

&lt;p&gt;But what's more fun than me looking at this stuff? It's you.&lt;/p&gt;

&lt;p&gt;Go click the links. Play with things a little bit. See if you can find something, and let me know what you found.&lt;/p&gt;

&lt;p&gt;Let's learn, together, from the public RubyGems data!&lt;/p&gt;

</description>
      <category>o11y</category>
      <category>observability</category>
      <category>honeycomb</category>
    </item>
  </channel>
</rss>
