DEV Community: Danyel Fisher

They Aren’t Pillars, They’re Lenses

Danyel Fisher — Tue, 22 Dec 2020 23:40:29 +0000

To have Observability is to have the ability to understand your system’s internal state based on signals and externally-visible output. Honeycomb’s approach to Observability is to strive toward this: every feature of the product attempts to move closer to a unified vision of figuring out what your system did, and how it got there. Our approach is to let people smoothly move between aggregated views of their data, like heat-maps and line charts, into views that emphasize collections of events, like traces and BubbleUp, into views that emphasize single events, like raw data.

In the broader marketplace, though, Observability is often promoted as “three pillars” — separating logging, monitoring, and tracing (aka logs, metrics & traces) as three distinct capabilities. We believe that separating these capabilities misses out on the true power of solving a problem with rich observability.

The metaphor I like is to think of each feature as a lens on your data. Like a lens, they remove some wavelengths of information in exchange for emphasizing others. To debug in hi-res, you need to be able to see all the vivid colors.

Let’s say, for example, that you’re tracking a service that seems to be acting up. An alert has gone off, saying that some users are having a poor experience. Monitoring tools that track metrics—the first pillar-- will interpret data as a time series of numbers and gauges — and that’s really important, because it’s useful to know such things as how long a process takes to launch or how long a web page takes to load. Using a metrics monitoring tool (e.g. Prometheus) will help generate that alert. If the monitoring tool supports high cardinality — the ability to track hundreds or thousands of different values — you can even find out which endpoints those users are encountering and, perhaps, some information about which users.

You could think of that as a magnifying glass with a blue lens on your data. It comes out looking something like this:

The second pillar is traces or tracing, which looks at individual calls and dives into how they are processed. From inside a tracing tool (e.g. Jaeger), you can do wonderful things — you can see which component took the longest or shortest, and you can see whether specific functions resulted in errors. In this case, for example, we might be able to use the information we found from the metrics to try to find a trace that hits the same endpoint. That trace might help us identify that the slow part of the trace was the call to the database, which is now taking much more time than before.

(Of course, the process of getting from the metrics monitoring tool to the tracing tool is bumpy: the two types of tools collect different data. You need to find out how to correlate information in the metrics tool and the tracing tool. The process can be time-consuming and doesn’t always give you the accuracy you need. The fields might be called different things, and might use different encodings. Indeed, the key data might not be available in the two systems.)

In our lens analogy, that’s a red lens. From this lens, the picture looks pretty different — but there’s enough in common that we can tell we’re looking at the same image. There are some parts that stand out and are much more visible; other aspects of detail entirely disappear.

But why did the database calls get slow? To continue debugging, you can look through logs, which is the third pillar. Maybe scrolling around in the logs, you might find some warnings issued by the database to show that it was overloaded at the time, or logs showing that the event queue had gotten long. That helps figure out what had happened to the database — but it’s a limited view. If we want to know how often this problem had arisen, we’d need to go back to the metrics to learn the history of the database queue.

Like before, the process of switching tools, from tracing to logging, requires a new set of searches, a new set of interactions and of course more time.

We could think of that as a green lens.

When companies sell the “three pillars of observability”, they lump all these visualizations together, but as separate capabilities:

That’s not a bad start. Some things are completely invisible in one view, but easy to see in others, so placing them side by side can help alleviate those gaps. Each image brings different aspects more clearly into view: the blue image shows the outline of the flowers best; the red shows the detail in the florets; and the green seems to get the shading and depth best.

But these three separate lenses have limitations. True observability is not just the ability to see each piece at a time; it’s also the ability to understand the whole and to see how the pieces combine to tell you the state of the underlying system.

The truth is, of course, there aren’t three different systems interacting: there is one underlying system in all its richness. If we separate out these dimensions — if we collect metrics monitoring separately from log and traces — then we lose the fact that this data reflects the single underlying system.

We need to collect and preserve that richness and dimensionality. We need to move through the data smoothly, precisely, and efficiently. We need to be able to discover where a trace has a phenomenon that may be occurring over and over in other traces, and to find out where and how often. We need to break down a monitoring chart into its underlying components to understand which factors really cause a spike.

One way to implement this is to maintain a single set of telemetry collection and storage that keeps rich enough data that we can view it as metrics monitoring, tracing, or logging — or in some other perspective.

Honeycomb’s event store acts a single source of truth for everything that has happened in your system. Monitoring, tracing, logging are simply different views of system events being stored — and it’s easy to switch quickly and easily between different views. Tracing isn’t a separate experience of the event store: it’s a different lens that brings certain aspects into sharper focus. Any point on a heat-map or a metric line-chart connects to a trace and any span on a trace can be turned into a query result.

This single event store also enables Honeycomb to provide unique features such as BubbleUp. This is the ability to visually show a slice across the data — in other words how two sets of events differ from each other, across all their various dimensions (fields). That’s the sort of question that metrics systems simply cannot show (because they don’t store the individual events), and let’s face it that would be exhausting in a dedicated log system.

--
What do you do when you have separate pieces of the complete picture? You need to manually connect the parts and make the connections, looking for correlates. In our lens analogy, that might be like seeing that an area shows as light colored in both the green and the red lens, so it must be yellow.

You COULD do that math yourself. Flip back and forth. Stare at where bits contrast.

Or, you could use a tool where seeing the image isn’t a matter of skill or experience of combining those pieces in your head: it’s all laid out, so you can see it as one complete beautiful picture.

Claude Monet,

“Bouquet of Sunflowers,” 1881

Join the swarm. Get started with Honeycomb for free!

Challenges with Implementing SLOs

Danyel Fisher — Mon, 07 Dec 2020 20:44:21 +0000

A few months ago, Honeycomb released our SLO — Service Level Objective — feature to the world. We’ve written before about how to use it and some of the use scenarios. Today, I’d like to say a little more about how the feature has evolved, and what we did in the process of creating it. (Some of these notes are based on my talk, “Pitfalls in Measuring SLOs;” you can find the slides to that talk here, or view the video on our Honeycomb Talks page).

Honeycomb approaches SLOs a little differently than some of the market does, and it’s interesting to step back and see how we made our decisions.

If you aren’t familiar with our SLO feature, I’d encourage you to check out the SLO webcast and our other documentation. The shortest summary, though, is that an SLO is a way of expressing how reliable a service is. An SLO comes in two parts: a metric (or indicator) that can measure the quality of a service, and an expectation of how often the service meets that metric.

When Liz Fong-Jones joined Honeycomb, she came carrying the banner of SLOs. She’d had a lot of experience with them as an SRE at Google, and wanted us to support SLOs, too. Honeycomb had an interesting secret weapon, though: the fact that Honeycomb stores rich, wide events [REF] means that we can do things with SLOs that otherwise aren’t possible.

The core concept of SLOs is outlined in the Google SRE book and workbook, and in Alex Hidalgo’s upcoming Implementing Service Level Objectives book. In the process of implementing SLOs, though, we found that there were a number of issues that aren’t well-articulated in the Google texts; I’d like to spend a little time analyzing what we learned.

I’ve put this post together because it might be fun to take a look behind the scenes — at what it takes to roll out this feature; at some of the dead ends and mistakes we made; and how we managed to spend $10,000 of AWS on one particularly embarrassing day.

As background, I’m trained as a human-computer interaction researcher. That means I design against user needs, and build based on challenges that users are encountering. My toolbox includes a lot of prototyping, interviewing, and collecting early internal feedback. Fortunately, Honeycomb has a large and supportive user community — the “Pollinators” — who love to help each other, and give vocal and frequent feedback.

Expressing SLOs

In Honeycomb, you can express static triggers pretty easily: simply set an aggregate operation (like COUNT or AVERAGE) and a filter. The whole experience re-uses our familiar query builder.

We tried to go down the same path with SLOs. Unfortunately, it required an extra filter and — when we rolled it out with paper prototypes — more settings and screens than we really wanted.

We decided to reduce our goals, at least in the short term. Users would create SLOs far fewer times than they would view and monitor their SLOs; our effort should be spent on the monitoring experience. Many of our users who were starting out with SLOs would be more advanced users; because SLOs are aimed at enterprise users, we could help ensure our customer success team was ready to help users create them.

In the end, we realized, an SLO was just three things: an SLI (“Service Level Indicator”), a time period, and a percentage. "Was the SLI fulfilled 99% of the time over the last 28 days?" An SLI, in turn, is a function that returns TRUE, FALSE, or N/A for every event. This turns out to be very easy to express in Honeycomb Derived Columns. Indeed, it meant we could even create an SLI Cookbook that helped express some common patterns.

We might revisit this decision at some point in the future — it would be nice to make the experience friendlier as we’re beginning to learn more about how users want to use them. But it was also useful to realize that we could allow that piece of the experience to be less well-designed.

Tracking SLOs

Our goal in putting together the main SLO display was to let users see where the burndown was happening, explain why it was happening, and remediate any problems they detected.

This screenshot gives a sense of the SLO screen. At the top-left, the “remaining budget” shows how the current SLO has been burned down over 30 days. Note that the current budget has 46.7% remaining. We can also see that the budget has been gradually burning slowly and steadily.

The top-right view shows our overall compliance: for each day of the last 30, what did the previous 30 look like? We’re gradually getting better.

Contrast this view: we’ve burned almost three times our budget (-176% means we burned the first 100%, and then another 1.76 of it).

A lot of that was due to a crash a few weeks ago — but, to be honest, our steady state of burning was probably lower than it wanted to be. Indeed, we’ve never exceeded our goal of 99.5% at any time in the last 30 days.

Explaining the Burn

The top parts of the screen are familiar and occur in many tools. The bottom part of the chart is, to me, more interesting, as they take advantage of the unique aspects that Honeycomb has to offer. Honeycomb is all about the high-cardinality, high-dimensional data. We love it when users send us rich, complex events: it lets us give them tools to provide rich comparisons between events.

The chart on this page shows a heatmap. Like other Honeycomb heatmaps, it shows the number of events with various durations (Y axis) by time (X Axis). This time, though, it adds yellow events that failed the SLI. This image shows that the yellow failed events are largely ones that are a little slower than we might expect; but a few are being processed quickly and failing.

Contrast this image, which shows that most of the failed events happened in a single burst of time. (The time axis on the bottom chart looks only at one day, and is currently set on one of the big crash days).

Last, we can use Honeycomb’s BubbleUp capability to contrast events that succeeded to those that failed — across every dimension that is in the dataset! For example, in this chart, we see (in the top left) that failing events had status codes of 400 and 500, while succeeding events had status codes of 200.

We can also see — zooming into the field app.user.email in the second row — that this incident of burn is actually due only to one person — encountering a small number of errors.

This sort of rich explanation also lets us discover how to respond to the incident. For example, seeing this error, we know several things: we can reach out to the relevant customer to find out what the impact on them was; meanwhile, we can send the issue to the UX team to try to figure out what sequence of actions got to this error.

User Responses to SLOs

From the earlier versions of SLOs, we got useful and enthusiastic responses to the early betas. One user who had set up their SLO system, for example, wrote: “The Bubble Up in the SLO page is really powerful at highlighting what is contributing the most to missing our SLIs, it has definitely confirmed our assumptions.”

Another found SLOs were good to show that their engineering effort was going in the right direction: ““The historical SLO chart also confirms a fix for a performance issue we did that greatly contributed to the SLO compliance by showing a nice upward trend line. :)”

Unfortunately, after that first burst of ebullience, enthusiasm began to wane a little. A third customer finally gave us the necessary insight: “I’d love to drive alerts off our SLOs. Right now we don’t have anything to draw us in and have some alerts on the average error rate .... It would be great to get a better sense of when the budget is going down and define alerts that way.”

Designing an Alert System

We had hoped to put off alerts until after SLOs were finished. It had become clear to us — from our experience internally as well as our user feedback — that alerts were a fundamental part of the SLO experience. Fortunately, it wasn’t hard to design an alerting system that would warn you when your SLO was going to fail in several hours, or when your budget had burned out. We could extrapolate from the last few hours what the next few would look like; after some experimentation, we settled on a 1:4 ratio of baseline to prediction: that is, a one-hour baseline would be used to predict four hours from now; a six hour baseline would be used to predict the next day.

We built our first alert system, wired it up to check on status every minute ... and promptly arranged a $10,000 day of AWS spend on data retrieval. (Before this, our most expensive query had cleared $0.25; this was a new and surprising cost).

Tony Hoare is known to have said that “premature optimization is the root of all evil;” it turns out that some optimization, however, can come too late. In our case, running a one-minute resolution query across sixty days of data, every minute, was asking a lot of both our query system and our storage.

We paused alerts; and rapidly implemented a caching system.

Caching is always an interesting challenge, but perhaps the most dramatic issue we ran into is that Honeycomb is designed as a best-effort querying system which is ok with occasional incorrect answers. (It’s even one of our company values: “Fast and close to right is better than perfect!”). Unfortunately, when you cache a close-to-right value, that means that you keep an incorrect value in your cache for an extended period; occasional incorrect values had an outsize effect on SLO quality. Some investigation showed that our database was able to identify queries that were approximations, and that few retries would usually produce a correct value; we ended up ensuring that we simply didn’t cache approximations.

(We had other challenges with caching and quirks of the database, but I think those are less relevant from a design perspective.)

Handling Flappy Alerts

One of our key design goals was to reduce the number of alerts produced by noisy systems. Within a few days of release, though, a customer started complaining that their alert was turning just as noisy.

We realized they’d been unlucky: their system happened to be such that their burndown was being estimated at just about four hours. A bad event would pop in — and it would drop to 3:55. A good event would show up, and they’d bump up to 4:05. This flapping would turn on and off the alerts, frustrating and annoying users.

Fortunately, the fix was easy once we’d figured out the problem: we added a small buffer, and the problems went away.

Learning from Experience

Last, I’d like to reflect just a little on what we’ve learned from the SLO experience, and some best practices for handling SLOs.

Volume is important. A very small number of events really shouldn’t be enough to exhaust budget: if two or three failures can do it, then most likely, a standard alert would be the right use case. The SLO should tolerate at least in the dozens of failed events a day. Doing the math backward, an SLO of 99.9% needs a minimum level of traffic of a few tens of thousands of events a day to be meaningful.
Test pathways, not users: It’s tempting to write an SLO per customer, to find out whether any customer is having a bad experience. That seems not to be as productive a path: first, it reduces volume (because each customer now needs those tens of thousands of events); second, if a single customer is having a problem, does that imply something about the system, or the customer? Instead, writing SLOs on paths through the system and on user scenarios seems like a better way to identify commonalities.
Iterating is important: We learned rapidly that some of our internal SLOs were often off by a bit: they tested the wrong things, or had the wrong intuition for what it meant for something to be broken. For example, “status code >= 400” gets user errors (400s) as well as failures in our own system (500s). Iterating on them helped us figure out what we wanted.
Cultural change around SLOs can be slow. Alerts on numbers in the system are familiar; SLOs are new and may seem a little unpredictable. Internally, our teams had been slow to adopt SLOs; after an incident hit that the SLOs caught long before alarms, engineers started watching SLOs more carefully.

Conclusion

SLOs are an observability feature for characterizing what went wrong, how badly it went wrong, and how to prioritize repair. The pathway to implementing SLOs, however, was not as straightforward as we’d hoped. My hope by putting this post together is to help future implementors make decisions for their paths — and to help users know a little more about what’s going on behind the scenes.

Want to know more about what Honeycomb can do for your business? Check out our short demo!

Honeycomb SLO Now Generally Available: Success, Defined.

Danyel Fisher — Fri, 04 Dec 2020 23:24:12 +0000

Previously, in this series, we created a derived column to show how a back-end service was doing. That column categorized every incoming event as passing, failing, or irrelevant. We then counted up the column over time to see how many events passed and failed. But we had a problem: we were doing far too much math ourselves.

To address that problem, Honeycomb has now released SLO Support! Unsurprisingly, it is based on precisely the principles we discussed above.

To recall, the derived column looked something like this:

IF(
  AND(
   EQUALS($request.endpoint, "batch"),
   EQUALS($request.method, "POST")
  ),
  AND(
   EQUALS($response.status_code, 200),
   LT($duration_ms, 100)
  )
)

which meant, “we only count requests that hit the batch endpoint, and use the POST method. If they do, then we will say the SLI has succeeded if we processed it in under 100 ms, and returned a 200; otherwise, we’ll call it a failure.” We counted the percentage of total requests as our SLI success rate. For example, we might say that over the last thirty days, we managed a 99.4% SLI success rate.

Formalizing this structure

We’ll pick an SLI. An SLI (Service Level Indicator) consists of the ability to sort all the events in my dataset into three groups: those that are irrelevant, those that pass, and those that fail.
Now, we’ll pick a target level for this SLI. “Of the relevant events, we want 99.95% of them to pass.”
Last, we’ll pick a duration for them: “Over each 30 days, we expect our SLI to be at 99.95% passing.”

The nice thing about this is that we can quantify how our SLI is doing. We can look at a dataset, and see what percentage of events have succeeded.

This is a really useful way to think about systems that are constantly in minor states of error. Ordinary noise happens; this can lead to transient failures or occasional alerts. We can use this structure to ask how much these minor running errors are costing us.

(When there’s a catastrophic failure, frankly, SLOs are less surprising: every light on every console is blinking red and the phone is buzzing. We’ll use SLOs in those cases to estimate “how bad was this incident.”)

Understanding your Error Budget

Let’s make the assumption that we expect to see 100,000 relevant events in a given thirty day period. Let’s further say that, say, 700 of them have failed over the last 27 days. Over the next three days, we can afford for another 300 events to fail and still maintain a 99.9% SLO.

This gets to the concept of an error budget. In Honeycomb’s implementation, error budgets are continuously rolling: at any moment, old errors are slowly scrolling away into the past, no longer counting against your budget.

In our example, we'll assume that the world looks something like this: The grey line at the top is the total number of events a system has sent. It’s staying pretty constant. The orange line shows errors.

For this chart, the Y scale on the errors is exaggerated: after all, if you’re running at 99.9%, that means that there’s 1/1000 the number of errors as successes. (The orange line would be very small!)

33 days ago, there was an incident which caused the number of errors to spike. Fortunately, we got that under control pretty quickly. Two weeks ago, there was a slower-burning incident, which took a little longer to straighten out.

Checking the Burn Down graph

It would be great to track when we spent our error budget. Was the painful part of our last month those big spikes? Or was it the fact that we’ve had a small, continuous background burn the other time? How much were those background events costing us?

The burn down graph shows the last month, and how much budget was burned every day. If we had looked at the graph last week, we'd have seen that your last 30 days had been burnt, pretty hard, by that first incident, and then again by the second. The rest of the time has been a slow, continuous burn: nothing too bad. That helps us make decisions: are we just barely making budget every month? Is the loss due to incidents, or is it because we are slowly burning away over time?

Both of those can be totally fine! For some systems, it’s perfectly reasonable to have a slow, gentle burn of occasional errors. For others, we want to keep our powder dry to compensate for more-severe outages!

The graph from six days ago was looking dismal. That first incident had burned 40% of budget in a single incident; the usual pace of “a few percent a day” meant that the budget was nearly exhausted.

But if we look at the burn down graph today, things are looking better! The first incident is off the books, and now we're only making up for the errors of D2. Someday, that too will be forgotten.

We should also take a look at how we compare to the goal. For every day, we can compute the percentage of events that has passed the SLI. As you can see, we’re usually above 95% for most 30 day periods. At the trough of the first incident, things were pretty bad — and we lost ground, again, with the second one — but now we’re maintaining a comfortably higher level.

Now, all these illustrations have shown moments when our problems were comfortably in the past. While that’s a great place to have our problems, we wouldn’t be using Honeycomb if all our problems were solved. That’s why there are two other important SLO aspects to think about:

SLO Burn Alerts

When the error rate is gradually increasing, it would be great to know when we'll run out of budget. Honeycomb creates Burn Alerts to show when our SLO will run out of budget. The green line shows the gradually shrinking budget, but on a slightly adjusted window.

Then, Honeycomb predicts forward. The orange line looks at how our last hour has been, and then interpolates forward to the next four hours. In this image, the four hour estimate is going to dip below zero — and so the system warns the user.

This can let us know how long until we use up our error budget. It acts as a forewarning against slow failures.

It’s really useful to have a couple of different time ranges. A 24 hour alert can mean “you’ve got a slow degradation in your service; you might want to fix it. — but worry about it in the morning.” A four hour alert means “it’s time to get cracking” (at Honeycomb, we tend to send 24 hour alerts to Slack channels, but 4 hour alerts to PagerDuty).

Find out why it's going wrong

This wouldn’t be Honeycomb if we didn’t provide you tools to dive into an issue. The SLO Page shows a Heatmap and a BubbleUp of the last 24 hours, so you can figure out what’s changed and how you want to take it on.

Here’s a great example: the SLO page for a Honeycomb tool that’s looking at rendering speed. (Yep, we’ve even set an SLO on end user experience!) This is a pretty loose SLO — really, we’re keeping it around to alarm us if our pages suddenly get really bad — but we can see that we’re doing OK against our goals.

The bottom half of the page shows where the problems are coming from. The BubbleUp heatmap shows the last day of events: higher up are yellow, meaning these events fail the SLI; lower down are blue, meaning they are compliant with the SLI. We can see that mostly this is happening when events are particularly slow.

We can also look in there and see that it’s one particular page that seems to be having the worst experience, and one particular user email that’s running slow. That’s a pretty cool insight — it tells us where to look, and how we might want to handle it. It also gives us a sense for what repro cases to look for, and figure out what strange thing this user is doing.

Now, define your own SLOs

Honeycomb SLOs are now released and are available to Enterprise/yearly contract customers. We’d love to learn more about how you think about SLOs, and what you use them for.

Read the final installment in this blog series: Challenges with Implementing SLOs

New to Honeycomb? Get started for free!

Working Toward Service Level Objectives (SLOs), Part 1

Danyel Fisher — Fri, 04 Dec 2020 23:19:16 +0000

In theory, Honeycomb is always up. Our servers run without hiccups, our user interface loads rapidly and is highly responsive, and our query engine is lightning fast. In practice, this isn’t always perfectly the case — and dedicated readers of this blog have learned about how we use those experiences to improve the product.

Now, we could spend all our time on system stability. We could polish the front-end code and look for inefficiencies; throw ever-harder test-cases at the back-end. (There are a few developers who are vocal advocates for that approach!) But we also want to make the product better — and so we keep rolling out so many great features!

How do we decide when to work on improving stability, and when we get to go make fun little tweaks?

From organizational objectives to events

So let’s say we come to an agreement with each other about "how bad" things are. When things are bad—when we’re feeling mired in small errors, or one big error that takes down the service—we can slow down on Cool Feature Development, and switch over to stability work. Conversely, when things are feeling reasonably stable, we can believe we have a pretty solid infrastructure for development, and can slow down on repair and maintenance work.

What would that agreement look like?

First, it means being able to take a good hard look at our system. Honeycomb has a mature level of observability, so we feel pretty confident that we have the raw tools to look at how we’re doing — where users are experiencing challenges, and where bugs are appearing in our system.

Second, it means coming to understand that no system is perfect. If our goal is 100% uptime at all times for all requests, then we’ll be disappointed, because some things will fail from time to time. But we can come up with statements about quality of service. Honeycomb had an internal meeting where we worked to quantify this:

We pretty much never want to lose customer data. We live and die by storing customer data, so we want every batch of customer telemetry to get back a positive response, and quickly. Let’s say that we want them to be handled in under 100 ms, without errors, for 99.95% of requests. (That means that in a full year, we could have 4 hours of downtime.)
We want our main service to be up pretty much every time someone clicks on honeycomb.io, and we want it to load pretty quickly. Let’s say we want to load the page without errors, within a second, for 99.9% of requests.
Sometimes, when you run a query, it takes a little longer. For that, we decided that 99.5% of data queries should come back within 10 seconds and not return an error.

These are entirely reasonable goals. The wonderful thing is, they can actually be expressed in Honeycomb’s dogfood servers as Derived Column expressions:

For example, we would write that first one — the one about data ingest — as “We’re talking about events where request.endpoint uses the batch endpoint and the input is a POST request. When they do, they should return a code 200, and the duration_ms should be under 100.

Let’s call this a “Service Level Indicator,” because we use it to indicate how our service is doing. In our derived column language, that looks like

IF(
  AND(
   EQUALS($request.endpoint, "batch"),
   EQUALS($request.method, "POST")
  ),
  AND(
   EQUALS($response.status_code, 200),
   LT($duration_ms, 100)
  )
)

We name that derived column “SLI”, and we can generate a COUNT and a HEATMAP on it

This looks pretty sane: we see that there are many more points that are true (the indicator is ok! everything is great!) than false (oh, no, they failed!); and we can see that all the points that are slowest are in the “false” group.

Let’s whip out our Trusty Pocket Calculator: 35K events with “false”; 171 million with true. That’s about a 0.02% failure rate — we’re up at 99.98%. Sounds like we’re doing ok!

But there are still some failures. I’d love to know why!

By clicking over to the BubbleUp tab, I can find out who is having this slow experience. I highlight all the slowest requests, and BubbleUp shows me a histogram for every dimension in the dataset. By finding those columns that are most different from everything else, I can see wehre these errors stand out.

... and I see that it’s one particular customer, and one particular team. Not only that, but they’re using a fairly unusual API for Honeycomb (that’s the fourth entry, request.header.user-agent)

This is great, and highly actionable! I can reach out to the customer, and find out what’s up; I can send our integrations team to go look at that particular package, and see if we’re doing something that’s making it hard to use well.

Quantifying quality of service means you can measure it

So bringing that back to where we started: we’ve found a way to start with organizational goals, and found a way to quantify our abstract concept: “always up and fast” now has a meaning, and is measurable. We can then use that to diagnose what’s going wrong, and figure out how to make it faster.

Part 2, coming soon: Wait, why did I have to pull out my pocket calculator? Don’t we have computers for that? Also, this term “SLI”, it feels familiar somehow...

Read the next post in the series:
Honeycomb SLO Now Generally Available: Success, Defined.

Excited about what the future of operational excellence looks like? Get started with Honeycomb for free.

RubyGems surge during Cloudflare outage: an investigation

Danyel Fisher — Sat, 18 Jul 2020 00:27:04 +0000

Today at Honeycomb we saw a huge drop in incoming data volume from several customers. It turned out to be a result of the Cloudflare outage.

honeycomb

@honeycombio

Here's what the downstream effects of a @Cloudflare outage looked like to our customer success team: some of our customers processed less traffic (and thus sent us less telemetry), while one had a traffic surge!

Our measurements show the impact lasted 25 minutes and is now over.

22:09 PM - 17 Jul 2020

15 39

I did this investigation as a live thread in the replies to the above Honeycomb tweet, but I'm copying it here as well. Follow along with us!

.ltag__user__id__434528 .follow-action-button { background-color: #2e0338 !important; color: #ffffff !important; border-color: #2e0338 !important; }

Danyel Fisher

/danyelf

danyelf

Do you notice that one spike that went upwards? Someone had increased traffic during the outage.

By astounding coincidence, it just happens to be RubyGems, which Honeycomb -- working with @rubytogether -- offers as a free public dataset for anyone to play with.

Which means that you can, without signing up, go explore live data from RubyGems! And since Honeycomb permalinks work forever, all the links here will open query results in the Honeycomb UI even after the original outage has aged out.

We're looking at the Fastly logs from the RubyGems site. Each of these represents a request of someone wanting to request something about a gem -- maybe get a list of them, or download one.

Exploring by Size

Let's add in a heatmap of the response_body_size -- in other words, how much were people downloading? Looks like during that time, we saw fewer requests for big downloads.

Let's grab a few samples of how the outage is different from before and after the outage. I click to "Bubbleup" -- again, you can do this too, with the same clicks I did! -- and grab that top area. This asks "how are the data points in the yellow area different than the green ones below?"

And we see the top few dimensions where the requests highlighted in yellow look different. That huge file was wkhtmltopdf. I'd love to know why it suddenly fell off the radar. Is there something about big packages that made them more likely to fail?

You know what? Let's check that hypothesis. Let's list all gems by their size (descending). Now we can scroll through some of the various gems and see if they also had interruptions.

I'll leave it to someone smarter than myself to explain why wkhtmltopdf-binary had a dropoff, but wkhtmltopdf-binary-edge saw consistent traffic.

Feel free to try this later!

As a side note: every one of these URLs will persist indefinitely. Honeycomb keeps all query results forever, so it's always easy to share URLs.

If you find something, it's definitely the best way to share it back! (On Slack, they even have pretty unfurls.)

Examining what went up

Back to the main heatmap. When I've got some outliers, a log transformation can help make the image much easier to read. So here we've got the response body size -- and the log scaled version.

That dark blue spot really stands out to me. Why are the requests in that dense area so different from everything else?

Let's try another BubbleUp: we'll grab the interesting area, select it, and bubbleup shows how those fields are different.

Most fields aren't very different. But then there's that one on the left: request_user_agent.

A quick hover tells us that's Ruby, RubyGems/2.6.14 x86_64-linux Ruby/2.4.10 (2020-03-31 patchlevel 364)

Um. WTF? My first hypothesis would be that it's a coincidence: maybe they just happened to release that new version, so these events are disproportionately of that release. (Correlation does not equal causation!)

So let's break down by user agent. There's a lot of them! They have different temporal patterns -- including a few others that may also have a similar pattern to the one we care about.

But still: this one seems special. When Cloudflare went down, it decided to spike like mad. Here it is in Honeycomb.

Was it just this one crazy user? Good thing we can break down by user IP address! (Hashed and sanitized for your protection). Nope.

What's next?

There are a lot of places we could go with this investigation. We could investigate what people downloaded, or what other user agents seem to have been affected, or whether caching policies made a difference.

We could look at the time it took to serve requests.

But what's more fun than me looking at this stuff? It's you.

Go click the links. Play with things a little bit. See if you can find something, and let me know what you found.

Let's learn, together, from the public RubyGems data!