DEV Community: Roshan shetty

Introducing our open source SLO Tracker - A simple tool to track SLOs and Error Budget

Roshan shetty — Wed, 08 Sep 2021 11:10:53 +0000

One of the tools we use internally at Squadcast for SLO and Error Budget tracking is now open-source. In keeping up with the SRE ideology of automating as many ops tasks as possible, we built this SLO Tracker. We made this open-source so that the SRE community can also use it too. Looking forward to get your feedback, suggestions and patches :)

Introduction to SLO tracking

The tenet of a strong SRE culture lies in responsibly managing Error Budgets. However, you can only calculate error budgets after establishing the expected service SLOs in agreement with all the relevant stakeholders.

After defining organization-wide SLOs, and the subsequent SLIs (to track SLAs), calculating Error Budgets is just a numbers-game. In short, these metrics are the foundation to establish a strong SRE culture and I cannot stress enough on how it promotes accountability, trust and timely innovation.

What are SLOs and SLIs?

Let’s understand this with an example. Assuming that your Service Level Indicators (SLIs) are - “xyz is true”, then the Service Level Objectives (SLOs) which are organization-level objectives will read - “xyz is true for a % of time” and the corresponding Service Level Agreements (SLAs) meant for external/ end users are legal contracts that say - “if xyz is not true for a % of time, then, so and so will be compensated”.

Typically, Error Budgets allow you to track downtime as real time with a burn rate. It is calculated as “1-(Service Level Objectives)”. So, an SLO of 99.99% yearly means it is acceptable for the service to be down for no more than 52.56 minutes in a year.

The development team can spend their Error Budget however they feel is right - either by preventing or by fixing system instabilities.

Ensuring service uptime is just one of the SRE objectives to ensure user gratification. A few other basic indicators concerning end user requirements could be:

App load time should be less than 3 seconds,
Load times for every feature in the app should be less than 3 seconds,
Less buggy features rolled out - not more than 2 bugs reported by users in a span of 20 days,
‘Update time’ for data inputs to reflect should be less than 4 seconds,
And Retrieval of data within the app should be less than 2 seconds, to name a few.

Put in simple words, there needs to be a balance between what is acceptable to the end user versus what is actually deliverable keeping in mind the effort and budget needed. Understanding where end users can compromise with the experience is key. Based on that, identifying the right target thresholds for the identified indicators would be easier.

Impact of SLOs on organizational SLAs

The ideal way to start off is by doing just enough to reduce the number of complaints raised by the end user for a particular feature in the app. For example, when a user is trying to retrieve a huge data set from the app, they would be ready to accept a slight tolerance/ delay. In such a case, promising a 99% SLO for this indicator is both unnecessary and unrealistic. A more sensible target would be around 85% SLO. Even after satisfying this threshold, if users continue complaining, then the indicators and objectives along with their thresholds can be revisited.

In addition to this, having telemetry and observability in place is very important. Without tracking these indicators, you will not be able to measure end user experience against SLO thresholds. This also gives you a sense of the other dependent factors and how their performance can affect the overall performance of a feature or the application in general.

Defining SLOs is a journey and not a destination. You should constantly refine your SLOs because with time, many factors change such as the user base, size of your app and user expectations, etc. Hence your SLOs should be defined mainly to achieve user satisfaction.

Challenges in SLO monitoring

Over the years of setting up SLOs, I have come up against this routine challenge of dealing with False Positives. No matter how efficient or accurate, monitoring tools will sometimes flag an event as an issue in spite of no violation of SLOs. Thus triggering a false positive. So keep in mind that building an efficient, battle-tested and trustingly insightful platform takes time.

During the early days, I’ve noticed teams getting a lot of false positives, which will eat into the Error Budget. And I’ve always yearned for a feature that can help me easily mark events as false positives so as to get precious minutes back into the Error Budget. This helps in practicing observability with actionable data.

Another basic challenge faced by engineers in organizations is tracking all the defined SLIs. Since SLOs are monitored by multiple tools in the observability stack, not maintaining a unified dashboard to accurately track the error budget will make them oblivious to the error budget burn rate.

Thus a single source of truth with multiple SLOs (across all services) tracked in one place, will ensure greater reliability. In most cases, services will be dependent on one another and thus outages are inevitable. The aim here is not just to 'not fail'. Instead, it is about failing measurably and with enough insights to mend it, we can ensure it does not happen again.

The challenges can be summarized as:

Lack of a centralized dashboard for tracking SLIs (from multiple alert sources)
Too many ‘False Positives’ eating into the error budget
Short retention period of metrics stored in Prometheus (or other monitoring tools)

Tackling these challenges which started off as a hobby, became my passion. And that is how this open-source project came into existence.

Introducing the SLO Tracker

As someone who painstakingly experienced the challenges with SLO monitoring, I built this open source project “SLO tracker” - as a simplified means to track the defined SLOs, Error Budgets, and Error Budget burn rate with intuitive graphs and visualizations. This dashboard is easy to set up and makes it simple to aggregate SLI metrics coming in from different sources.

You will be required to first set up your target SLOs. The Error Budget will be calculated and allocated based on that. The SLO Tracker currently:

Provides a unified dashboard for all the SLOs that have been set up, in turn giving insights into the SLIs being tracked
Gives you a clear visualisation of the Error Budget and alerts you when Error Budget burn rate threshold gets breached
Supports Webhook integrations with various observability tools (Prometheus, Pingdom, New Relic) and whenever an alert is received from these tools, the tracker will re-calculate and reduce time from the allocated Error Budget
Provides the ability to claim your falsely spent Error Budget back by marking erroneous SLO violation alerts as False Positives
Supports manual alert creation from the web app when a violation is not caught:
- Either by your monitoring tool due to various reasons, but should have been
- Or, if your monitoring tool is not integrated with SLO Tracker
Displays basic Analytics for SLO violation distribution (SLI distribution graph)
Is easy to set up, lightweight since it only stores and computes what matters (SLO violation alerts) and not the bulk of the data (every single metric)

How to set this up?

Docker-compose file is already part of the project repo. You can bring all the components up with it.
Once all the components are up, Users can start adding SLOs from the frontend.
"Alert Sources" button will have all the webhook links of supported integrations. Users can add these webhook URLs to their respective monitoring tools.

Final Thoughts

I hope this blog helped you understand the annoyance around SLO and Error Budget tracking. In keeping up with the SRE ideology of automating as many ops tasks as possible, we built this SLO Tracker.

While this started off as a tool for internal use, we have now made it open-source for everyone to use, provide suggestions, code patches or contribute in any way that can make this a better tool. Let’s make the path to reliability a smoother ride for everyone :)

Faster Incident Resolution with Context Rich Alerts

Roshan shetty — Wed, 16 Jun 2021 12:16:33 +0000

Labelling your alert payloads although simple can significantly improve the time it takes for your team to respond to incidents. In this blog learn how Squadcast's auto-tagging feature can be a game changer by enabling intelligent labelling & routing of alerts to ultimately reduce your MTTR.

A frequent problem faced by on-call engineers when critical outages occur is pinpointing the exact point of failure. Even though modern monitoring tools and incident management platforms provide context around each alert, there is still room for improvement. A relatively simple solution is to add labels to your alert payloads.

As an on-call engineer, you may have faced a situation where a major alert took a long time to triage, often because the alert payload was missing crucial information like hostname/cluster-details etc. in a Kubernetes setup.

In this blog, let's understand how we can add labels to important information within the payload so as to reduce MTTR (Mean time to respond).

We’ll also explore how Prometheus Alert manager and Squadcast with Routing, and Tagging rules can ensure that Alert Payloads with specific labels are sent to the concerned engineers for faster remediation of issues.

Note: A Payload Label can be used to classify the Payload data and identify crucial information. While we cover a Kubernetes specific example in this blog, this can be done with other monitoring tools as well.

The below screenshot is an example of a basic Payload (without any labels).

As an on-call engineer, you will need more details about the alert such as:

IP address / hostname
Cluster identification, etc.

Since this info. is not available in the payload, you have to manually fetch the IP address for troubleshooting the issue.

Your life could’ve been simpler if you were given details such as IP addr., hostname, application name, severity level, environment name, etc., within the alert itself. You could have ignored the alert if it came from a test/staging environment as it would have had environment related labels in the payload.

There is a relatively simple way to add labels to the payload using your preferred monitoring tool. In the following example, we will be using Prometheus Alert Manager to make context-rich alert payloads.

Below screenshot is an example of a configuration file in Prometheus Alert manager that has labels for context-rich payloads.

Prometheus Alert manager config file

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pubsub
  namespace: laddertruck
  labels:
    name: "laddertruck"
    language: "ruby"
    language-version: "3.0.0"
    framework: "rails"
    framework-version: "5.2.1"
    team: "xyz"
    developed-by: "diane"
    service-owner: "john"

Note the various labels mentioned above.

How to decide which labels to add to the payload?

Naming the labels will depend on the type of technology stack and on-call team you have in place. The type of labels to be used should be decided by the on-call team since they are the first responders when a critical outage occurs.

The labels shown below are some of the common ones your team can use to get started with:

owner: This tag identifies the owner of the service.
language: The programming language the service is written in.
framework: The framework on which the service is built. This is vital if there are multiple services written in the same language but utilising different frameworks.

So now, let's setup an Alerting rule in the Prometheus Alert manager:

- alert: ContainerMemoryUsage
  expr: (sum(container_memory_working_set_bytes{namespace="default"}) BY (instance, name) / sum(container_spec_memory_limit_bytes{namespace="default"} > 0) BY (instance, name) * 100) > 90
  for: 1m
  labels:
    severity: warning
  annotations:
    summary: "Container Memory usage (instance {{ $labels.instance }})"
    description: "Container Memory usage is above 90%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"

In the above rule, we are defining the conditions for severity in the alerting rule.

So now, when an alert is sent to Squadcast from Alertmanager, all the relevant information will be embedded in the payload, such as severity, deployment and other labels mentioned in the Prometheus configuration file. We can make use of Squadcast routing rules to efficiently manage/route the incident to the concerned person/ team.

Furthermore, we can make use of the annotations option in alertmanager to send much more detailed and lengthy information in the alert payload. In organisations that are reliant on playbooks / runbooks on-call engineers can start troubleshooting right away rather than searching for the relevant playbook when an incident occurs.

Image Source

As seen here, the alert notification from Prometheus contains a link to the internal runbook.

Configuring Squadcast with help of labels

In the screenshot below you can see how to configure Squadcast to route alerts based on the labels attached to the incident payload. Here the service which we are configuring is called “K8s Cluster Monitoring”.

‘Tagging Rules’ is used to define the labels that will be processed by Squadcast. Once the ‘tags’ have been defined, Routing Rules can be used to send alerts to the concerned person, team or escalation policy.

All the labels defined in the payload are now recognised as ‘tags’ in Squadcast. Once this step is complete we can define custom routing rules based on the ‘tags’.

In the screenshot above we have created three tags based on the following criteria: “service-owner”, “squad” and “deployed-by”.

Now that these tags have been defined, we can go on to create granular incident routing rules for the service. In the screenshot above we are creating a routing rule where if the alert payload has ‘serviceowner’ defined as ‘john’ the alert notification is sent directly to John.

Above we can see the routing rule being created when the alert payload has the label ‘team’ in it. We can see the alert notification getting routed to the specific team.

In this instance the alert will be routed to the person who deployed the feature (‘Diane’)

Previously, we have seen alerts getting routed to specific individuals however it is also possible to route alers to predefined escalation policies. In the screenshot above we see an example of the same.

Choose multiple tags and boolean operators (‘AND’) to make routing rules as specific as possible and to cover as many possible scenarios.

In Squadcast, Routing Rules are executed in a top-down approach. If the first rule is executed, then the remaining rules are automatically ignored. The Execution Priority feature helps in defining the order in which the rules will be implemented.

Other Benefits of context-rich alerts

Having contextual information around each payload is of great help during post-mortems / reviews after major outages since detailed incident timelines are automatically created. While the concept of context-rich alert payloads may seem simple but in the long term can help improve reliability of your system.

Below we can see an instance of an incident in Squadcast with their associated labels. With this rule-based, auto-tagging system you can now define customised tags based on incident payloads, that get automatically assigned to incidents when they are triggered.

Conclusion

In many cases it is not feasible to reduce the ad-hoc complexity of your existing architecture. This is where the combination of ‘context rich alerts’ + ‘intelligent routing’ helps to drastically reduce the MTTA and MTTR.

The example provided above is just the tip of the iceberg - you can create custom labels and routing rules as well. As your infrastructure scales up with new users and dependencies, your MTTR will still be within acceptable limits thanks to better labeling and routing.

What do you struggle with as a DevOps/SRE? Do you have ideas on how incident response could be done better in your organization? We would be thrilled to hear from you! Leave us a comment or reach out over a DM via Twitter and let us know your thoughts.

Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.

How to track your product's SLO/ErrorBudget: A simple tool to keep track of things!

Roshan shetty — Mon, 12 Apr 2021 07:19:31 +0000

Today, Most of the organization track their product SLO’s to avoid being liable for breach of SLAs (Service level agreements). In case of any SLO violation, They will be under obligation to pay something in return for breach of contract. Once the SLO for their product has been defined, A corresponding error budget will be calculated based on that number. For example, If 99.99% is the SLO, then the error budget will be 52.56 mins in a year. That’s the amount of downtime that the product may have in a year without breaching the SLO.

Once companies agree on the SLO, they need to pick the most relevant SLI’s(service level indicators). Any violation of these SLI’s will be considered as downtime and the duration of downtime will be deducted from the error budget. For example, a payment gateway product might have the following SLI’s.

Latency on p95 for requests
ErrorRates
Payment failures etc

Additional reading:

https://sre.google/workbook/implementing-slos/
https://sre.google/workbook/error-budget-policy/

Why is it challenging for many companies to track error budgets at the moment?

Usually, Organizations use a mix of tools to monitor/track these SLI’s (For eg: latency-related SLI’s generally tracked in APM’s such as Newrelic while other SLI’s tracked in monitoring tools such as Prometheus/Datadog etc). That makes it hard to keep track of the error budget in one centralized location.

Sometimes companies have a very low retention period(<6 months) for their metrics in Prometheus. Retaining metrics for a longer period may require setting up Thanos/Cortex, federation rules, and performing capacity planning for their metrics storage.

Next comes the problem of false positives - Even if you are tracking something in Prometheus, it’s hard to flag an event as false positive when the incident is not a genuine SLO violation. Building an efficient and battle-tested monitoring platform takes time. Initially, Teams might end up getting a lot of false positives. and you may want to mark some old violations as false positives to get minutes back into your error budget

What does the SLO tracker do?

This error budget tracker seeks to provide a simple and effective way to keep track of the error budget, burn rate without the hassle of configuring and aggregating multiple data sources.

Users first have to set up their target SLO and the error budget will be allocated based on that.
It currently supports webhook integration with few monitoring tools(Prometheus, Pingdom, and Newrelic) and whenever it receives an incident from these tools, It will reduce some error budget.
If a violation is not caught in your monitoring tool or if this tool doesn’t have integration with your monitoring tool then the incident can be reported manually through the user interface.
Provides some analytics into SLO violation distribution. (SLI distribution graph)
This tool doesn’t require much storage space since this only stores violations but not every metric.

How to set this up?

Clone the repo
Repo already has a docker-compose, So just run docker-compose up -d, and your setup is done!
Default creds are admin:admin. This can be changed in docker-compose.yaml.
Now set some SLO target in the UI.
To integrate this tool with other monitoring tools, You can use the following webhook url’s.
- For prometheus: serverip:8080/webhook/prometheus
- For Newrelic: serverip:8080/webhook/newrelic
- For Pingdom: serverip:8080/webhook/pingdom
Now set up rules to monitor SLI’s in your monitoring tool (Let’s see how this can be done in Prometheus). Alert manager rule to monitor an example SLI ==> Nginx p99 latency

  - alert: NginxLatencyHigh
    expr: histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[2m])) by (host, node)) > 3
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Nginx latency high (instance )
      description: Nginx p99 latency is higher than 3 seconds\n  VALUE = \n  LABELS:

Alert routing based on tags set in checks


          global:
            resolve_timeout: 10m
          route:
            routes:
            - receiver: blackhole
            - receiver: slo-tracker
              group_wait: 10s
              match_re:
                severity: critical
              continue: true
          receivers:
          - name: ‘slo-tracker’
            webhook_config: 
              url: 
'http://ENTERIP:8080/webhook/prometheus'
              send_resolved: true

Add different tags if you don’t want to route requests based on the severity tags.

What’s next:

Add a few more monitoring tool integration
Tracking multiple product SLO’s
Add more graphs for analytics
Better visualization tools to pinpoint problematic services

This project is open-source. Feel free to open a PR or raise issue :)

If you would like to see the dashboard then please check this out!
(admin:admin is the creds. Also please use laptop to open this webapp. It's not mobile-friendly yet)