One of the tools we use internally at Squadcast for SLO and Error Budget tracking is now open-source. In keeping up with the SRE ideology of automating as many ops tasks as possible, we built this SLO Tracker. We made this open-source so that the SRE community can also use it too. Looking forward to get your feedback, suggestions and patches :)
The tenet of a strong SRE culture lies in responsibly managing Error Budgets. However, you can only calculate error budgets after establishing the expected service SLOs in agreement with all the relevant stakeholders.
After defining organization-wide SLOs, and the subsequent SLIs (to track SLAs), calculating Error Budgets is just a numbers-game. In short, these metrics are the foundation to establish a strong SRE culture and I cannot stress enough on how it promotes accountability, trust and timely innovation.
Let’s understand this with an example. Assuming that your Service Level Indicators (SLIs) are - “xyz is true”, then the Service Level Objectives (SLOs) which are organization-level objectives will read - “xyz is true for a % of time” and the corresponding Service Level Agreements (SLAs) meant for external/ end users are legal contracts that say - “if xyz is not true for a % of time, then, so and so will be compensated”.
Typically, Error Budgets allow you to track downtime as real time with a burn rate. It is calculated as “1-(Service Level Objectives)”. So, an SLO of 99.99% yearly means it is acceptable for the service to be down for no more than 52.56 minutes in a year.
The development team can spend their Error Budget however they feel is right - either by preventing or by fixing system instabilities.
Ensuring service uptime is just one of the SRE objectives to ensure user gratification. A few other basic indicators concerning end user requirements could be:
- App load time should be less than 3 seconds,
- Load times for every feature in the app should be less than 3 seconds,
- Less buggy features rolled out - not more than 2 bugs reported by users in a span of 20 days,
- ‘Update time’ for data inputs to reflect should be less than 4 seconds,
- And Retrieval of data within the app should be less than 2 seconds, to name a few.
Put in simple words, there needs to be a balance between what is acceptable to the end user versus what is actually deliverable keeping in mind the effort and budget needed. Understanding where end users can compromise with the experience is key. Based on that, identifying the right target thresholds for the identified indicators would be easier.
The ideal way to start off is by doing just enough to reduce the number of complaints raised by the end user for a particular feature in the app. For example, when a user is trying to retrieve a huge data set from the app, they would be ready to accept a slight tolerance/ delay. In such a case, promising a 99% SLO for this indicator is both unnecessary and unrealistic. A more sensible target would be around 85% SLO. Even after satisfying this threshold, if users continue complaining, then the indicators and objectives along with their thresholds can be revisited.
In addition to this, having telemetry and observability in place is very important. Without tracking these indicators, you will not be able to measure end user experience against SLO thresholds. This also gives you a sense of the other dependent factors and how their performance can affect the overall performance of a feature or the application in general.
Defining SLOs is a journey and not a destination. You should constantly refine your SLOs because with time, many factors change such as the user base, size of your app and user expectations, etc. Hence your SLOs should be defined mainly to achieve user satisfaction.
Over the years of setting up SLOs, I have come up against this routine challenge of dealing with False Positives. No matter how efficient or accurate, monitoring tools will sometimes flag an event as an issue in spite of no violation of SLOs. Thus triggering a false positive. So keep in mind that building an efficient, battle-tested and trustingly insightful platform takes time.
During the early days, I’ve noticed teams getting a lot of false positives, which will eat into the Error Budget. And I’ve always yearned for a feature that can help me easily mark events as false positives so as to get precious minutes back into the Error Budget. This helps in practicing observability with actionable data.
Another basic challenge faced by engineers in organizations is tracking all the defined SLIs. Since SLOs are monitored by multiple tools in the observability stack, not maintaining a unified dashboard to accurately track the error budget will make them oblivious to the error budget burn rate.
Thus a single source of truth with multiple SLOs (across all services) tracked in one place, will ensure greater reliability. In most cases, services will be dependent on one another and thus outages are inevitable. The aim here is not just to 'not fail'. Instead, it is about failing measurably and with enough insights to mend it, we can ensure it does not happen again.
The challenges can be summarized as:
- Lack of a centralized dashboard for tracking SLIs (from multiple alert sources)
- Too many ‘False Positives’ eating into the error budget
- Short retention period of metrics stored in Prometheus (or other monitoring tools)
Tackling these challenges which started off as a hobby, became my passion. And that is how this open-source project came into existence.
As someone who painstakingly experienced the challenges with SLO monitoring, I built this open source project “SLO tracker” - as a simplified means to track the defined SLOs, Error Budgets, and Error Budget burn rate with intuitive graphs and visualizations. This dashboard is easy to set up and makes it simple to aggregate SLI metrics coming in from different sources.
You will be required to first set up your target SLOs. The Error Budget will be calculated and allocated based on that. The SLO Tracker currently:
- Provides a unified dashboard for all the SLOs that have been set up, in turn giving insights into the SLIs being tracked
- Gives you a clear visualisation of the Error Budget and alerts you when Error Budget burn rate threshold gets breached
- Supports Webhook integrations with various observability tools (Prometheus, Pingdom, New Relic) and whenever an alert is received from these tools, the tracker will re-calculate and reduce time from the allocated Error Budget
- Provides the ability to claim your falsely spent Error Budget back by marking erroneous SLO violation alerts as False Positives
- Supports manual alert creation from the web app when a violation is not caught:
- Either by your monitoring tool due to various reasons, but should have been
- Or, if your monitoring tool is not integrated with SLO Tracker
- Displays basic Analytics for SLO violation distribution (SLI distribution graph)
- Is easy to set up, lightweight since it only stores and computes what matters (SLO violation alerts) and not the bulk of the data (every single metric)
- Docker-compose file is already part of the project repo. You can bring all the components up with it.
- Once all the components are up, Users can start adding SLOs from the frontend.
- "Alert Sources" button will have all the webhook links of supported integrations. Users can add these webhook URLs to their respective monitoring tools.
I hope this blog helped you understand the annoyance around SLO and Error Budget tracking. In keeping up with the SRE ideology of automating as many ops tasks as possible, we built this SLO Tracker.
While this started off as a tool for internal use, we have now made it open-source for everyone to use, provide suggestions, code patches or contribute in any way that can make this a better tool. Let’s make the path to reliability a smoother ride for everyone :)