DEV Community: MerlynMShelley

How Squadcast Benefits On-call Engineers - Part 1

MerlynMShelley — Tue, 24 Aug 2021 12:20:32 +0000

It is difficult to stay completely reliable in an always-on world. So it's very important to choose the right Incident Management solution that can solve your problems. In this blog, we have highlighted the benefits of Squadcast and why you should adopt it.

“Being on-call sucks!"

Often incident response teams use this phrase when talking about their on-call experiences. Despite using best practices for managing infrastructure, incidents do occur from time to time.

In order to avoid delays in responding to incidents and prevent being overwhelmed by on-call notifications, you should find a solution that helps in resolving incidents efficiently. Squadcast is one such platform that has helped numerous teams respond to critical incidents quicker than ever before.

In fact, Squadcast is packed with a whole lot of easy to use features that help engineers make informed decisions. Now to start with, let’s understand the key benefits that help in alleviating the stress of being on-call.

Benefits of Squadcast

1. Swift Incident Response

For instance, if your monitoring tool sends alerts that you cannot understand at first glance, then your recovery process will be delayed. This would result in increased MTTR and MTTA. With Squadcast, you can add context to the incidents, making it easier to understand and take action. Tagging and Routing are such features that will help you achieve this.

a. Incident Tagging

Responding to incidents becomes much easier when the engineers have enough context regarding the incident. This can be achieved by tagging incidents with relevant information like priority, severity, or alert type within the incoming alert.

This rule-based, auto-tagging system can be established by defining rules on payloads associated with incidents. These tags also help you search for specific incidents and filter a group of incidents on the analytics and incident list pages.

You have the option to tag each incident with a personalized tagging expression as shown in the screenshot below. As soon as an incident is triggered, tags can be automatically attached to it. It also provides insight into the severity levels of incidents. That way, incidents can be better understood.

For example, from the below screenshot, with Datadog's payload, you can define rules in Squadcast to categorize an incident as a high or low priority. This allows you to respond to alerts appropriately.

Incident Tagging

b. Incident Routing

Usually incident management solutions offer routing as a basic feature that helps teams to direct alerts internally. But in Squadcast, you can customize alert routing based on the rules you define and you will be able to personalize your notification mechanism as well. That way you won't be bothered with irrelevant alerts in your communications channels.

For instance, once the incidents are tagged, you can filter and sort them based on their priority or severity. Then you can route relevant incidents to the right users for resolution. In this case, tagging can make it easier to route low priority incidents to Level-1 support teams and higher priority incidents to Level-2 support teams.

This helps in routing the right alerts to the right responder(s) based on the tags they contain. In this way, it helps greatly in reducing the MTTR of an incident.

Incident Routing

2. Reduce Alert Noise, Alert Fatigue, and Operational Toil Drastically

Alert noise is a condition when you're alerted for both critical and non-critical incidents. This is dangerous because it can lead to Alert fatigue and prevents you from responding to critical incidents. With Squadcast you can eliminate this problem completely.

This blog will explain how to optimize your alerts with Squadcast.

a. Incident deduplication (Event Aggregation)

In the case of Alert noise, you can deduplicate events arising from the same source or multiple alert sources (dependent services). As you can see from the below screenshot, by enabling the checkbox at the bottom, you can select other dependent services for the deduplication rule. Now the same incident arising from different alert sources will be grouped into one.

Deduplication Rules

When the first alert for such an incident is triggered, the subsequent alerts for the same incident will be grouped to the original one. But it will be in a ‘triggered’ state across the incident dashboard.

Additionally, you can turn on alert suppression, which will automatically suppress alerts that are not critical. This means you won’t be able to take any action on the alerts since they're in the ‘suppressed’ state on the incident dashboard. You need to be more careful in defining suppression rules because you will no longer be able to act on them.

Alert Suppression

b. Minimize Operational Toil

Toil is a repetitive set of tasks that makes you feel bored, burned out, and exhausted. In case of incident management, it reduces productivity, affects employee morale, and increases attrition. Click here to read more about how to reduce toil.

Squadcast can significantly reduce an SRE's burden with features such as automatic suppression, deduplication, rule-based routing, escalation policies, and on-call schedules. We have also come up with this unique onboarding checklist that will assist you in getting organized for the on-call process. These personalized options would relieve you from tiring on-call chores.

3. Customize On-Call Operations at Ease

On-call coordination is key to resolving an incident.

Every member in the incident management team must be informed of on-call schedules. Each team member must know the name and phone number of those who are on call at any given time.

Likewise, you need to know who will be on call and at what time. And, if you do not acknowledge an incident on time, it should be escalated to the next on-call engineer.

If this process is not set up properly, it will lead to increased MTTA and MTTR. That’s why on-call schedules and escalation plans play an important role in a smooth incident response process. Squadcast offers 12 layers of personalized escalation policies to alert on-call engineers. That way you will not miss important alerts, and your engineers will not be dismayed either.

a. On-call Escalation Policy

After an incident is triggered, the alerts will be routed based on the defined escalation policy. If a user is not available at that time, the escalation requests are forwarded to the next available user in the escalation policy. This is also called on-call scheduling or on-call rotation.

You will be notified via text messages, emails, and phone calls with our unique 12-layer setup of On-call Escalations. That is, through the platform, the on-call manager can define 12 levels of escalation rules to notify the appropriate user group about an incident. The on-call escalation policy is repeated three times until an incident is acknowledged.

If you are an on-call engineer, you can personalize the notification settings according to your preferences. You can reassign an incident to another user group when necessary. Plus, you can easily create and investigate incidents within Slack.

Reassigning an Incident

b. Incident Notes

The platform also comes with an ‘Incident Notes’ feature that allows you to notify a user (on-call engineer) by mentioning their name with ‘@’ in the incident notes panel. So, it fosters ownership of incidents within the team.

c. Incident Postmortem

The platform has incident postmortem templates where you can edit and add details about the incident management processes at ease. While analyzing incidents, postmortems play a crucial role in performing root cause analysis. It helps the team with pre-defined templates to document various findings.

Incident Postmortem Template

In the above screenshot, you'll see how to edit a predefined incident postmortem template, and the options to update or download it in PDF/MD format. This increases transparency within an organization and all employees can stay informed about the progress of the incident and the various processes that are taken to resolve them.

Final Thoughts

Squadcast, with its customizable features such as contextual tagging, routing, suppression and more, makes it easy for incident management teams to manage and resolve incidents quickly.

By using this intuitive platform, customers have experienced a significant reduction in their MTTR and MTTA. It eliminates alert noise, fatigue, and operational toil, thereby increasing the productivity of the team as a whole.

These are just a handful of benefits discussed in this article. We have a lot of features stacking up in the product roadmap that we'll be sharing with you in the upcoming blogs. So stay tuned! Get started with Squadcast for free and tell us what you think.

Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.

Five Ways Developers Can Help SREs

MerlynMShelley — Tue, 17 Aug 2021 13:04:50 +0000

Reliability is a team game. More the collaboration between Developers and SREs, greater will be the success of the product. In this blog, we have listed down the five best practices that developers can adopt, to make the SRE's life easier.

It is not easy to be a site reliability engineer. Monitoring system infrastructure and aligning them with the key reliability metrics is quite a daunting task. Whereas, a software engineer's job is to deliver high-quality software.

Relationships between software engineers and site reliability engineers can sometimes be tricky. To begin with, developers are generally assigned to write code that goes into production. Then, there are Site Reliability Engineers (SREs) who are responsible for improving the product's reliability and performance.

Ideally, the goal of any world-scale distributed system (product or service) is to operate in harmony from day one. To achieve this, developers and the operations team must team up to create a reliable system. This will help developers build solutions in a faster and transparent way so that SREs can manage applications effectively.

Here's What Developers Can Do To Help SREs

Developers and SREs are two sides of the same coin within a tech company. Developers work towards delivering successful software, and SREs ensure the software's uptime and overall health.

The development of software is a continuous process where its health and performance characteristics must be monitored after delivery. SRE practices ensure product reliability. Site reliability engineers are in charge of making sure the software functions as expected.

SREs and software engineers must work with a wide spectrum of information like response time and MTTR from virtualized deep layers of cloud platforms. In short, developers can help SREs by making the source code easy to understand, access, and modify to optimize the system’s performance.

Following are five ways developers can help SREs

1. Scaling The Platform With The Concept Of A 12-factor App Method

A 12-factor app is a new way to build modern web applications. By default, it is meant to be stateless and immutable. That means it can be deployed in any cloud environment like Heroku, where we don't entirely control the infrastructure.

The twelve factors of this scalable approach to building applications are codebase, dependencies, config, backing services, (Build, release, run), processes, port binding, concurrency, disposability, Dev/prod parity, logs, and admin processes. And these are suited to polyglot programming.

The goal of the project is runtime independence. In other words, you will be able to run applications in any environment, without facing any difficulty operating in the cloud. It determines an app's packaging, deployment, and run-time.

It is an effective way to establish a resilient architecture that minimizes failure points and runs on a local or cloud back-end. The benefits of this approach are safe for deployment, highly available, auto-scalable, horizontally scalable, stateless, location transparent, and dynamically configurable.

It is also used for structuring an application or system so that it is portable, scalable, and stable when deployed to any cloud provider. So, the workload of an SRE is reduced to a larger extent.

2. Sharing Performance Testing Data Insights

As a software testing practice, performance testing focuses on assessing the software functions under various complex conditions.

SREs need to know the metrics of performance-tested applications in order to understand the thresholds. It enables them to understand what needs to be done to make the application work as intended.

For example in the context of backend applications, developers use tools like Gatling to load test the applications to measure how much load the application could take. This data should be shared with the SRE team as well.

There are some slight overlaps between the 12-factor app method and the following approaches. However, each is effective at creating synergies between development and operations.

3. Significance of Documentation and Configuration files

The success of SRE teams depends on documentation. They should be provided with well-defined bodies of documentation associated with various SRE functions. They need to know which documentation is most relevant for troubleshooting an outage.

Next, config files allow you to change your application configuration without modifying the source code. They store website-specific information like passwords, login details, database connection strings(URLs), username, password, API addresses of dependent/auxiliary services, application-specific parameters, etc. They help you track and control various data related to your web applications.

Configuration variables in code could act like parameters that could change based on external factors, for example, the URL of another web service or database, or queue. Likewise, if we are configuring the “token” module, the config file will tell us what token types are available and how to use each one of them.

It should also tell us about the default values of that token, whether it has any dependencies on another token or not etc. Also, if there are any special cases defined for that particular token, they should be documented in the same configuration file. During incident response operations, SREs use configuration files to restore system infrastructure.

4. AIOps Supported System Admin Functionalities

The site reliability engineer (SRE) needs to reboot and deploy servers constantly, even when there is no downtime. This will require quite a bit of effort when an update is deployed in production.

In this case, the SRE team should be notified of system changes via the configuration files or documentation accessible through the admin dashboard. This can also be done by developing custom Artificial Intelligence for IT Operations (AIOps) solutions.

This process helps SREs in maintaining and operating data centers using AI-powered methods and tools. For example, these AI-based tools can help in root cause analysis for remediation, automated anomaly detection, optimization, and the automatic initiation of self-stabilising activities.

5. Increasing Observability Of The System

Cloud-native systems are becoming increasingly complex, making observability paramount. Making your system easily observable means knowing what is causing problems with it or how systems interact with it. Observability maximizes visibility over the infrastructure.

Observability tools have a great deal of value in the world of DevOps and SRE. These give more data about logs, metrics, error rates, traces, and even network interface information. Whereas, application performance monitoring (APM) is a means to track your application's code performance. These tools help you locate and resolve issues with the performance of your applications.

Developers can help SREs by enabling debug support here. This can be done by allowing the applications to expose relevant metrics like request count, details about successful/failed requests, etc., in the case of a web service. This way, observability helps the SRE determine how the application is performing in production and if it needs to be scaled up/out.

Final Thoughts

With these best practices, developers can make the SRE's life easy and simple. Tell us how these five ways helped an SRE organize their daily chores and enable them to be more productive.

Top SRE Toolchain Used By Site Reliability Engineers

MerlynMShelley — Thu, 20 May 2021 12:36:05 +0000

We have compiled a list of the most popular and sought out tools (some you may have heard of) that SREs need in their toolkit - at every phase of a production system to keep up with SRE best practices

Site reliability engineering (SRE) practices help organizations by ensuring smooth functioning of their deliverables with utmost reliability and resilience.

These can be achieved by a set of well-defined tools that are deployed at every phase of the production system to keep up with SRE best practices.

This blog identifies and lists the chain of top SRE tools and their significance towards ensuring reliability of the architecture.

How to Standardize SRE Practices with SRE Toolchain

Every organization would have its own order of practice in framing its infrastructure. So depending on how they build their architecture, the standardization of SRE tools would come into the picture. For example, a social networking architecture would focus on establishing high-level support facilities and easily scalable infrastructure. Hence they would rely on tools that center around cloud-native applications, DevOps, and CI/CD automation. Whereas on the other hand, an e-commerce platform would rely on application, data storage, and DevOps tools for building and maintaining its architecture in accordance with SRE practices.

Thus, by comparing and considering the basic requirements of every architecture, we have arrived at a set of SRE tool stack that can potentially help standardize SRE best practices.

SRE Toolchain and Top Tools Used by SREs in Each Category

1. Containers for Microservices and Orchestration Tools

Microservices are the kind of infrastructure that splits up the whole architecture (monolithic) into multiple individual logical functions or services. Containers play a vital role in gathering all the requirements (code, libraries, dependencies, binaries, etc.,) of microservices in one place to execute all their capabilities.

Tools	Key Features	Open Source (Y/N)	Pricing
Docker	Used as a comprehensive end-to-end platform that accelerates the process of portable application development both cloud and desktop	Y	NA
Kubernetes	Generally, referred to as K8s used for automating deployment, scaling, and delivery lifecycle management of containerized applications	Y	NA
Swarm	Natively manages a cluster of Docker containers and deploys the application services	Y	NA
Apache Mesos	This distributed systems kernel supports linear scalability, native support for Docker containers and facilitates two-level scheduling by running native cloud and legacy applications at the same time	Y	NA
Podman	A basic container engine used for the development, management, and running of OCI containers across LINUX systems	Y	NA

2. Source Control Tools

Source code is a vital element of cloud infrastructure. This main code has to be tracked, managed, and updated at once when any change is detected. This can be done with source control tools. These tools help the development team to embrace the changes in codebases. And ensures the source code is always updated for the effective functioning of the systems and infrastructure.

Git is a widely-used open source and free distributed version control system. Git is generally adopted by organization of all sizes for updating their source code and storing them across GitHub.

3. Continuous integration / Continuous Deployment (CI/CD) Tools

Continuous integration is the automatic testing practice of every change that has been affected on the source code. And continuous deployment follows continuous integration by pushing the tested codebase to the production environment. Here are few tools that can help in executing these functions,

Tools	Key Features	Open Source (Y/N)	Pricing
Jenkins	CI/CD Automation platform that supports automation across development, deployment, and testing of any project.	Y	NA
CircleCI	A CI/CD platform that helps in automating the application development process either across the platform’s cloud or organization’s own infrastructure	Y	Free & other pricing options available
GitLab	It is an open core model of open-source DevOps platform that helps with collaboration, gaining visibility, and enhances development velocity	Y	Free & other pricing options available
GoCD	Free open-source CI/CD server that helps with easy modeling and visualization of complex workflows	Y	NA
Semaphore	A CI/CD platform that assures enormous productivity by avoiding bottleneck points across the engineering team. It also facilitates Enterprise level CI/CD pipeline as a service	Y	Free, Pay-as-you-go, Enterprise Cloud plans are Available

4. Data Storage tools

Data is key ingredient to every digital business. It also forms an important asset that helps businesses in easing the decision-making process. As SRE metrics are framed upon system performance data, this has to be carefully stored in the best-suited and easy to access interface. Below are a set of tools that could greatly help in data storage and processing.

Tools	Key Features	Open Source (Y/N)	Pricing
MySQL	Fully managed database service that helps deploy cloud-native applications. It comes with a highly efficient analytics engine to accelerate the overall database services	Y	NA
PostgreSQL	Open-source object-relational database service that has powerful features to support the cloud applications’ performance factors	Y	NA
MongoDB	Document orientated database service that supports JSON for modern cloud applications with features like horizontal scaling, automatic failover, and the ability to assign particular data to a location	Y	NA
Apache Hadoop	Open-source software library and framework that helps in processing large sets of distributed data across the network	Y	NA
Apache Hive	Data warehouse software that facilitates reading, writing, sharing, and managing huge sets of distributed data through SQL.	Y	NA

5. Configuration Management Tools

Configuration management is the process of tracking and controlling all the changes (configuration, identification, and implementation) that are made to a software product. These tools detect any unauthorized changes and control the implementation of changes across software solutions.

Tools	Key Features	Open Source (Y/N)	Pricing
Ansible	Simple configuration management and application deployment tool that helps in enabling infrastructure as code (IaC) architecture	Y	60-day trial, customized pricing available
Chef	Streamlines configuration management tasks across cloud platforms to automatically provision new machines	Y	Flexible pricing
Puppet	Model-driven software configuration management tool used to manage the entire lifecycle of IT infrastructure	Y	Customized pricing
Saltstack	Event-driven IT automation software used for infrastructure configuration, provisioning, and management	Y	Offers personalized pricing

6. Monitoring and Observability Tools

Monitoring and observability are two main functions in maintaining system health. SREs work closely with these monitoring tools. The prime role of site reliability engineers is to develop custom queries across alert managers that are present inside the monitoring tools’ architecture. These functions check whether all the system functionalities are working as expected. And helps to generate alerts when there is any deviation in system behavior.

Metrics Collection Tools

Tools	Key Features	Open Source (Y/N)	Pricing
Prometheus	An open-source monitoring tool that provides a dimensional (time-series) data model of all system performance characteristics	Y	NA
Google Cloud Operations (Stackdriver)	Helps in monitoring your infrastructure and troubleshoots applications by indicating errors with notifications	Y	Pricing calculator
InfluxDB	Supports the development team to build and monitor time-stamped data series across the infrastructure	Y	Free version & customized pricing
Sensu Go	An observability tool that helps in establishing monitoring as code across all cloud architecture	Y	Free plan, custom pricing

Log Aggregation Tools

Tools	Key Features	Open Source (Y/N)	Pricing
Fluentd	Open-source data collector built exclusively for the unified logging layer across an architecture	Y	NA
Sentry	Collects all the system data from various endpoints and optimizes the performances of the source code	Y	Pricing structure
Logstash	Open server-side data processing pipeline that helps the development team to ingest various data sources into a single preferred stash	Y	Advanced features with pricing structure

Distributed Tracing Tools

Tools	Key Features	Open Source (Y/N)	Pricing
OpenTelemetry	Open-source observability framework for monitoring cloud-native software applications with telemetry data. OpenTracing and OpenConsensus have merged to form a standardized OpenTelemetry tool	Y	NA
Jager	Open-source end-to-end distributed tracing platform that helps in monitoring and troubleshooting issues across a distributed network	Y	NA

Application Performance Monitoring Tools

Tools	Key Features	Open Source (Y/N)	Pricing
Appdynamics	Full-stack observability platform that provides real-time data insights for system performance and helps in driving business growth and productivity	Y	Pricing structure
New Relic	Simple observability tool that helps development teams in instrumenting, analyzing, troubleshooting, and optimizing their complete tech stack	Y	Pricing structure
Dynatrace	This tool has got observability, security features, intelligent solutions, and automation features in a single platform that helps developers to monitor the performance of the system effectively	N	Custom Pricing options available

7. Dashboarding Tools

Dashboarding tools help SREs to scrutinize issues more efficiently by displaying all the necessary data (Key Performance Indicators and Critical data points) in one screen. These tools facilitate pictorial or graphical representation of system data, thereby giving precise information about the system's health.

Tools	Key Features	Open Source (Y/N)	Pricing
Grafana	Provides an integrated solution to metrics and logs for composing observability characteristics in the form of graphical representation	Y	Free forever & Customized pricing
Stashboard	Status based dashboard solution for APIs and service-based software solutions	Y	NA
Redash	Helps to connect and create queries on data sources to visualize all the data in the form of a dashboard for easy collaboration across various teams	Y	30 day trial & other pricing options available
Metabase	An open-source tool for self-hosted platforms that enables them to connect data points for visualization purposes. Whereas, Metabase Cloud platform has exclusive advanced features like single sign-on and embedded analytics	Y	Free Open Source Version, Advanced features available with pricing

8. Incident Management / On-call Alerting System Tools

An incident management tool is an essential part while managing system architecture. These tools sit on top of all the monitoring/error tracking/logging applications and direct all the incoming system alerts to specific internal services to initiate the recovery processes.

Tools	Key Features	Open Source (Y/N)	Pricing
Pagerduty	An incident management tool with a real-time operations platform that ensures fewer outages	N	14-day free trial with pricing
Opsgenie	A modern incident management platform that ensures always-on digital services	N	14-day free trial with pricing
Squadcast	Cloud-based incident management platform built around Site reliability engineering (SRE) best practices that helps to improve incident resolution metrics and ultimately, the reliability of systems	N	Freemium version, advanced features with flexible pricing options

Conclusion

While choosing the right tools when building your SRE toolchain, there’s no “one-size-fits-all” set of tools.The tools SREs use at any given time will depend on where an organization is in their SRE journey. Organizations at the beginning or initial stages of their SRE journey will tend to use more specialised operations tools as opposed to more mature organizations. That said, SRE teams will experiment and adapt the right tools as they continue on their journey to seek new, efficient ways to bring more reliability to everything they do.

Regardless of the kind of platform you are running, we are sure that the tools listed here will be useful to you. On similar lines, for a more detailed look at the top observability tools used by DevOps/SREs, head over to this blog.

The Key Differences between SLI, SLO, and SLA in SRE

MerlynMShelley — Tue, 16 Feb 2021 11:39:27 +0000

To incentivize reliability in your platform, there should be shared goals across your team to measure & quantify the capabilities of your product/service along with customer experience. Define the path of "Always-On" services by understanding few key SRE fundamentals and their implications - SLIs, SLOs & SLA.

Framing SRE metrics for building or scaling a product is quite a daunting task.

In an SRE journey, the process of embracing risks and resolving them by proper service-level metrics are known to be the best way to achieve reliability.

In this blog, we explore the key differences between these basic site reliability metrics and their implications for building a sustainable and reliable product. Here’s the outline for quick reference:

What are the key differences between SLIs, SLOs, and SLA
- Service Level Indicators (SLIs)
- Service Level Objectives (SLOs)
- Service Level Agreements (SLA)
How these SRE metrics can help in drafting system performance and reliability
How to improve Customer experiences with right Target Values (SLOs)
- Choosing the Target Values
- Defining the Target Values (SLOs) in Practice
- Error Budgets
- Setting SLOs according to Customer Expectations
Key Takeaways
Conclusion

What are the differences between SLIs, SLOs, and SLA

SRE practices are now becoming more prevalent and much sought after best practices. And the prerequisite of success in SRE is availability.

These acronyms - SLIs, SLOs, and SLA are the primary metrics of Site Reliability Engineering (SRE).

Service Level Indicators (SLIs)

SLI as defined in Google’s SRE Handbook is, “a carefully defined quantitative measure of some aspect of the level of service that is provided.”

SLIs are measurements of the characteristics of a service. SLI’s directly gauge those behaviours that have the greatest impact on the customer experience. The most common SLIs or Four Golden signals are,

Latency
Traffic
Error rate
Saturation

Other variations are USE (Utilization, Saturation and Errors) and RED (Rate, Error and Durability).

The formula used to calculate SLI is,

SLI = Good Events * 100 / Valid Events

If the value of SLI is 100, the performance of the system is ideal and if it drops to 0, the system is broken.

It is Product (Service) - Centric, which means it always revolves around measuring the capabilities or characteristics of a product or service

Service Level Objectives (SLO)

SLOs are key threshold values for each SLI that quantify the availability and quality of service. They are an objective measure of your product’s reliability, or performance goals.

SLOs as explained in Google’s SRE workbook, “Service level objectives (SLOs) specify a target level for the reliability of your service. Because SLOs are key to making data-driven decisions about reliability, they’re at the core of SRE practices.”

These are numerical reliability or performance targets that a developer or an SRE should maintain while building and scaling a product. Any changes in the product or service must fall under these defined target values.

Additional Reading: You can check out our detailed case study on how to implement small changes in your SLOs by adopting the SMART (Specific, Measurable, Achievable, Relevant, and Time-bound) strategy in crafting the right SLOs for your business.

SLOs should be Customer-centric, they should be directly related to the customer experience. The core purpose of SLOs is to quantify customer reliability of the product and services.

SLOs can also be used to drive other improvements. For example, you could set an SLO for backup duration if you wanted to maintain or improve it.

Service Level Agreements (SLA)

SLA is an agreement between the service provider and customer about service deliverables.

With an SLA, the consumer would have a clear idea about the proposed product or service in terms of functionality, reliability, and performance.

Google’s CRE Life lessons define SLA as, “An SLA normally involves a promise to someone using your service that its availability should meet a certain level over a certain period, and if it fails to do so then some kind of penalty will be paid.”

SLAs are Vendor-User agreement and a Customer-centric metric that defines the committed functionality, performance, and reliability of a product or service as well as the penalty for non-compliance. It also helps in establishing transparency and trust between the company and its customers. And if the company breaches the terms agreed in the SLA then it is liable to reimburse the loss incurred for its customers.

How these SRE metrics help in drafting system performance and reliability

If you are a business that sells any product or service to your customers and assures them about your product capabilities, then you should draft your SLI, SLO and SLA now!

These service-level metrics would help you gain customer trust and improve your system reliability and performance.

As a service provider/vendor, you should start by coming up with key performance indicators that measure your product's performance, which forms your SLIs. Remember these are a direct measure of your system’s behaviour in every stage of your business.

Secondly, you have to set targets of availability for achieving these indicators, which forms your SLOs. This is a completely data-driven phase where you have to accumulate the data from customer queries, stakeholders' expectations, find the insights and finalize the target/threshold values to achieve better reliability.

As the final step, you should create your SLA. Here you have to list out the reliability values and help them understand your product's capabilities.
Thus, SLIs are the foundational blocks that help in building SLOs which in turn helps with overall reliability mentioned in your SLA.

How to improve Customer experiences with right Target Values(SLOs)

Customer experience plays an important part in deciding key SRE metrics. SLOs are the focus points in deciding the assured system reliability the company would offer to the end-users.

Choosing the Target Values

Choosing the appropriate SLOs or target values is in itself a complex technique!

Here are few key practices that can help you in deciding the right SLOs.

The target value of a service level is always measured only by an SLI.

There is an intricate dependency between SLIs and SLOs. This forms as a controlling characteristic while measuring and monitoring the entire system architecture. So, according to Google,

"A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound."

Lower bound SLOs ≤ SLI ≤ Upper bound SLOs

What should you do while choosing the right SLOs?

Never choose targets/SLOs based on the current performance of your systems, choose from your historic performances
Keep it simple - Don't specify absolute target values as SLOs
Don't aim for over-achievement or perfection, reliability cannot be 100%
Always keep a safety margin in SLOs, say like setting a historical average of your availability SLOs
Only choose SLOs that are sufficient to cover attributes of the system, which means have only a few SLOs
While drafting SLA, always remember

Reliability values in SLA < Historical Average of your availability SLOs‍

Defining the Target Values (SLOs) in Practice

Google emphasises the importance of defining the objectives in practice

“SLOs should specify how they’re measured and the conditions under which they’re valid” SLOs can never be 100%. But we can specify the limit of up to which constraint of time we can achieve the assured reliability. For example, you can specify the SLO targets in the performance curve as,

99.9% of SLO would complete a task in less than 100ms.
99% of SLO would complete a task in less than 10ms
90% of SLO would complete a task in less than 1 ms

This is where Error budgets in SRE come in handy, a rate at which SLOs can be missed. This provides a clear, objective metric that helps determine how unreliable service is allowed for a specific time. It also helps to establish a balance between reliability and innovation. According to Google's SRE book, "An error budget is just an SLO for meeting other SLOs!"

Error Budgets

Google's Motivation for Error Budgets, defines Error Budget as,

“the tool SRE uses to balance service reliability with the pace of innovation"

"the amount of error that your service can accumulate over a certain period before your users start being unhappy.”

SLI is expressed as a percentage, and the objectives derived from SLIs are the SLOs. Now, Error budget is the remainder value of the SLOs mentioned.

The formula for Error budget is,

Error budget = 100 - Internal Availability SLOs

So, in the above example, if the internal availability SLO is 99.95%, then the corresponding error budget would be (100-99.5) 0.05%. That is, you can serve up to or below the error of 0.05%.

According to Google's SRE blog, you have to measure every service you offer with an availability SLO, without which you cannot decide on making your systems more reliable. And if you assure the services to be more reliable, then the cost to operate will be expensive. So, by quantifying your services with availability SLOs you can either allow greater momentum for product development (but less reliable) or make your systems more reliable (but slow in product development).

And to improve your services you can build "deemed SLIs" or approximate SLIs to measure customer reliability of your platform at a very granular level. This contributes to measuring low-level outages and drives the operational response with which you can refine your customer expectations. This, in turn, helps you in scaling your product for better customer experience.

Setting SLOs according to Customer Expectations

Set an SLO buffer, which would help in accommodating maintenance window, improve the performance of the system without disappointing the users
Restrict over-dependence between the services that which drags down other services and takes longer time to load
While drafting an SLA, business and legal teams are required to pick appropriate consequences and penalties , in the event the agreement is breached. An SRE in the team helps them understand the likelihood and difficulty of meeting the SLOs contained in the SLA.
Be smart and conservative while you advertise your services’ SLOs because you cannot delete any of the SLA that are not achievable

Key Takeaways

You should prioritize setting up your availability SLOs than your SLA.
Make sure you mention the value of reliability in the SLA slightly lower than the historical average of your availability SLOs. This is to safeguard against the average being high because a failure has not occurred yet.
If your MTBF (mean time between failures) is 18 months and your service is only 6 months old then the measured SLA will be artificially high.
Also, if you ensure the reliability value greater than or equal to your availability SLO, then your team would lose the buffer between your goal and the penalty level.
Your accumulated errors for a certain period should fall within the Error budget calculated. If not then you will be breaching the SLA and that would correspond to financial loss.

Conclusion

Delivering product value solely depends on the performance and reliability of your services. Service level metrics act as a key tool to measure and quantify the capabilities of your product/service.

And, yes, it is necessary to define the path of how you are going to deliver the commitment towards "always-on" services. Appropriate SLOs and SLIs will help you define that path.

We hope that this article has helped you understand SLIs, SLOs, and SLA in a better way so that you can use them in improving your customer experience and overall product and service capabilities.