Originally published on Failure is Inevitable.
This is the second article of a two-part series. Click here for part 1 of the interview with Brian, Carrie, JP, and Zac to learn more about Twitter’s SRE journey.
Previously, we saw how SRE at Twitter has transformed their engineering practice to drive production readiness at scale. The concept of service level objectives (SLOs) and error budgets have been key to this transformation, as SLOs shape an organization’s ability to make data-oriented decisions around reliability. (Read here for a definition of SLOs and how they transformed Evernote.). Today, the Twitter team has invested in centralized tooling to measure, track, and visualize SLOs and their corresponding error budgets.
However, successfully implementing SLOs is far easier said than done. Many organizations have struggled with adoption for a number of reasons. Common obstacles include getting stakeholder buy-in, not knowing what (and how) to measure, and confusion over how to make SLOs actionable.
While the Twitter engineering team had laid a very strong foundation around observability and reliability, it took several important breakthroughs before SLOs began achieving broader adoption within the organization and the journey continues.
The foundations for SLO
Prior to SLOs, the engineers had used service level indicators (SLIs) for many years. The SLIs drew from Twitter’s extensive instrumentation of infrastructure and investments in their observability stack. Their observability stack provided a foundation for measuring service health with indicators such as success rate, latency, and throughput across their distributed service ecosystem. For example, the team would monitor the success rate for user-facing HTTP services, which they computed by looking at HTTP 500 errors versus total requests.
Integrating the SLIs with alerts and on-call rotations had been a core practice within their engineering teams for years. Additionally, their focus on incident management and postmortems has enabled them to continuously learn from their always evolving production ecosystem.
A significant inflection point came with embedding the concept of SLO within Finagle, Twitter’s RPC library, which is maintained by the Core Systems Libraries (CSL) team. As mentioned in the previous post, Finagle delivers reliability features such as load balancing, circuit breakers, failure detectors, and more, filling them inside every single piece of software that runs. In 2018, the CSL team made SLOs a first-class entity in the Twitter internal version of Finagle, creating a foundational API building block that is tied to a service boundary, which they call an objective. This was transformative in that it allowed the team to begin defining service-to-service interactions and modeling beyond just an alert, creating a programmatic definition that the team could now use to inform runtime decisions.
The Twitter team supported the implementation with proposals for projects and use cases that could use the SLO feature, and initially delivered the configuration as well as realtime per-instance measurement of SLOs.
In its initial phases, adoption of the feature was limited. Service owners could configure SLOs, but due to a lack of tooling and benefits automatically associated with turning SLOs on, there was little incentive to do so in context of other priorities.
Seeing this, the team invested in follow-up work. They began to build integrations and solutions for service owners on top of SLOs, such as load-shedding based on SLOs as they provided more useful context than a related metric like CPU throttling. Through piloting such enhancements, the appetite for adoption began to increase.
Defining SLOs
In thinking about how to define SLOs, the Twitter team typically begins by considering which features are key, and ensuring that they're well instrumented and understood.
It’s important to identify the signals that best reflect a critical user experience. Some signals for service success rate can provide color but are not so straightforward to interpret. For example, in analyzing the service error rate inside the data center, the client might retry those requests, making it a faulty datapoint to reason around what the true user success rate is.
Once the team sets a reasonable SLO at the top level, that will drive down through the services that a boundary depends on. Every service has a multitude of service dependencies, and thus the latency and success SLOs for all upstream and downstream services must all work together in context of the defined boundary. SLOs enable a more holistic way of measuring the whole call path.
A major turning point: tying SLOs to error budgets
The introduction of error budgets marked another critical inflection point in Twitter’s adoption of SLOs. Error budgets make SLOs actionable and provide a different lens to understand a service over time, so they were an important follow-on feature after the original delivery of SLOs.
Error budgets look at the SLO over time, and thus have allowed the team to begin tracking performance by providing a historical view into how the service met objectives through different timeframes. The traditional metric view tends to be shortsighted, and can bury signals around valuable trends and opportunities. Instead of a dashboard that charts hundreds of metrics, error budgets become a forcing function to pick a few of the most important metrics, and get deeper into how and why they change over time.
An important note is that the team does not prescribe a fixed set of actions upon the exhaustion of the error budget. While error budgets can be a powerful tool, the true value has to resonate with engineering and product teams.
With the notion of “context, not control” (coined by Netflix), there is strong emphasis on empowering well-intentioned, capable teammates with visualizations and insights to allow them to make better decisions. In the same way, Twitter SREs apply ongoing experimentation to understand what other team members will view as valuable to measure. They understand that error budgets are more about giving team members good tools and context; there is no one policy fits all.
For example, one team hypothesized that the error budget would help inform when automated deploys could proceed, and specifically, whether to pause a deploy if the error budget was exhausted. But what they found was that sometimes the deploy being paused or blocked could contain the fix for the increased errors. Thus, that simple rule of “block deploys if no error budget remains” quickly began to fall apart. The very deploy getting blocked could decrease the volume or rate of errors, and possibly even defend the SLO and enable it to be met over its remaining duration of time.
Bearing in mind that they aren’t necessarily meant to be prescriptive, error budgets provide very useful suggestions for service owners in thinking about how to prioritize work. They create an important ‘runway’ for scaling the pace of innovation up or down. For example, overly rapid error budget burndown could be a sign to prioritize mitigation work for the current on-call or an upcoming sprint. Alternatively, not using enough of the error budget could nudge the team to iterate on feature work faster, or experiment more.
The benefits of SLO
While the team is still early in its adoption of SLOs, they’ve already seen the immense potential and value of SLOs in several ways.
From a ‘distributed service zoo’ to a shared language
Twitter has hundreds, if not thousands, of services, making its infrastructure a complex beast to understand. The current members of the Twitter Command Center (TCC) have been around long enough where they generally know what most of the services are and how services ‘snap together’. However, they know that eventually they will reach a point where that becomes impossible, where no one individual can grok how it all works. By investing in SLOs now to help guide discussions, the goal is that by the time they reach that point of un-knowable complexity, they will have set themselves up to manage service metrics programmatically.
The right amount of context
Context is the key. Dashboards can easily have hundreds of charts which translate into thousands of metrics. Teams might have tens or hundreds of alerts on their services across multiple data centers. These dashboards, metrics, and alerts are helpful for those running those services, but they're very high context, and information overload for anyone else.
SLOs create the ability to have more directed conversations with shared context. Instead of looking at a hundred pictures of a dashboard, the team can align on the four or five things that matter. lf any of those are not green, others can understand that something's not right without having to know anything else about the service.
Dynamic load balancing and load shedding
By making SLOs a first class entity, services can speak it at the programming level, beyond just measuring it. This enables the team to make systematic improvements using SLOs as a building block. For example, the team is exploring whether back pressure in Finagle can instead be SLO-based.
With Finagle, services can programmatically detect when they are under load (typically with second class signals such as CPU), and then signal to redirect traffic to another instance. Instead of relying on second class signals to implement back pressure, a service can directly know if it’s trending towards an SLO violation in order to signal back pressure and reduce load on itself.
Graceful degradation
One of the Twitter team’s goals for SLO is in gracefully degrading services during large-scale events to ensure that core functionality is always available. Rather than an all-or-nothing failure mode, the team aims to gracefully degrade services by stripping away peripheral features while maintaining core functionality.
The Twitter team is interested in utilizing SLOs to implement a selective circuit breaker pattern to improve overall system reliability. Service owners can decide what upstream services are necessary for core functionality, and which are only necessary for add-ons or bells and whistles. An upstream service that is not very important to one service could be critical for others. A consuming service can implement a circuit breaker to detect and stop sending traffic to services experiencing high error rates.
SLOs as part of blameless culture
At Twitter, accountability is oriented around teammates holding themselves accountable to what they plan to do moving forward. As one of the most important aspects of effective operations is a blameless culture, SLOs are never tied to individual performance metrics. Rather, the goal is around defining or implementing more SLOs to get greater visibility and understanding, rather than blaming teams for not meeting them.
Top comments (0)