Patrick Londa for Steadybit

Posted on Mar 10 • Originally published at linkedin.com

The Business Case for Chaos Engineering: An ROI Calculator for Testing Application Reliability

#roi #chaosengineering #sre #testing

"What do we get from intentionally injecting failures into our systems?"

Chaos engineering is one of the best ways to proactively test your application reliability, but many leadership teams have never heard of the concept.

Engineering teams need to be able to frame a strong business case to explain the value of chaos engineering and reliability testing to budget holders. When there’s a major outage, the value of application reliability becomes immediately clear, but a strong ROI plan can help earn and maintain executive support while your systems are steady.

We have just released an interactive ROI calculator that can help SRE teams frame the business value of proactive reliability efforts like chaos engineering.

Ensuring Application Resilience with Chaos Engineering

Complex software systems are destined to break and fail at some point, especially with all the factors present in modern production environments. When we ask teams about their approach to system reliability, we often hear back: “We have enough chaos already!”

When done correctly, chaos engineering isn’t adding chaos to your systems. Rather, it’s running different scenarios to validate how resilient your systems are under stressful conditions. You can test your expectations vs. the reality of performance degradation during an availability zone outage, delayed dependency, or a sudden surge of users.

These experiments serve as a feedback loop earlier in the software development cycle that enable teams to design more fault-tolerant systems for their users.

When reliability testing is rolled out across applications, teams can find and address risks that could lead to critical incidents and improve their average time to remediation (MTTR).

Early ROI Prototypes and Why They Didn’t Work

📉 Savings from Reducing the Overall Number of Incidents

Initially, we explored the premise that implementing chaos engineering will lead to fewer incidents overall at every severity tier. If a user provides how many overall incidents they have had in the past year, we could assume some percentage reduction to all incidents. This makes the calculation pretty simple once a cost amount is assigned to each tier of incident.

The trouble with this approach is it assumes that all incidents should be avoided. Incidents as metrics can be useful in flagging anomalous behaviors and some might not necessarily result in negative impacts for many customers. An increase in low-level incidents could actually be a positive sign that alert coverage is improving and properly surfacing system weaknesses.

Moving forward, we chose to focus on savings from reducing the number of critical incidents (Sev0, P1, etc.) year over year. This is a metric that is easier to track without creating incentives to avoid marking incidents at any level.

🧪 Identifying & Fixing Reliability Risks By Running More Experiments

If you go from 0 to 100 experiment runs on your systems, you are bound to discover new performance gaps and reliability risks. How many more risks will you discover at 200 experiment runs?

We built a version of an ROI calculator that assumed that as the number of experiments increased, a certain percentage of experiments would reveal issues at different incident risk levels with assigned potential costs. Teams would then fix a certain percentage of these reliability risks depending on their development capacity. As the experiment run count scaled, there would be diminished returns for revealing new issues.

While it’s true that teams will find more potential reliability issues as they run more experiments, this approach was a little too one-dimensional. We didn’t have well-documented references for how issue detection rates would change, and teams would likely need to create a new reporting mechanism to follow-along as they mitigated risks.

There is also nuance as some teams will automate their experiment runs and use them as regression tests in their CI/CD processes. We decided it would be better to stick to measuring impact by with metrics that are already being tracked and available for SREs at most organizations.

Where We Landed with Key Metrics for Our ROI Calculator

⚡ Savings from Faster Average MTTR

As we were iterating on the inputs and outputs, we saw a great presentation from Keith Blizard and Joe Cho at AWS Re:Invent 2024, featuring a case study on the progress Fidelity Investments had made in rolling out chaos engineering across their organization. They documented major improvements to mean-time-to-resolution (MTTR) as they scaled chaos testing coverage across applications.

We used these case study metrics to plot the correlation between the percent of applications with chaos testing coverage to the incremental positive impact to their MTTR. We then used this relationship to calculate improvements against an assumed industry-wide average MTTR of 175 minutes, according to this 2024 PagerDuty report.

This MTTR savings means fewer minutes of downtime, which that same study estimated can cost between $4,000 - $15,000 per minute. In our calculator, we ask users to input their “Annual Company Revenue” so we can use the most relevant cost of downtime per minute, as downtime is typically more costly for larger enterprises. This 2024 report commissioned by BigPanda found that downtime cost an average of $14,056 per minute for organizations with more than 1,000 employees.

🛡️ Savings from Reducing Critical Incidents

At Steadybit, we partner with a wide range of customers and have seen how many major reliability gaps are uncovered by running chaos experiments. Using insights from our customers and referencing industry studies, we've seen that actively running reliability tests on any given application conservatively leads to an average 30% reduction in critical incidents for that application per year.

For our calculator, we ask users to input the total number of applications their organization operates and how many of these applications have reliability testing coverage. We multiply the standard 30% reduction per year in critical incidents by the percent of applications with testing coverage to get the overall incident reduction for the organization.

🛠️ Costs of Implementing Chaos Engineering

If you want to run chaos experiments at scale, you will likely need to onboard a commercial reliability platform or chaos engineering tool. Open source solutions can be a good starting place, but deploying these across teams and technologies can become increasingly time-intensive. We used general license estimates based on market knowledge and projected experiment activity.

Like with any new program, an organization will need engineers owning the project and dedicating time to a successful rollout of chaos testing. We included a field in our calculator for “Testing Rollout Managers”, measured by FTEs (40hr/week of staff time). We used an average SRE salary of $160k per year as a benchmark to estimate the cost of this implementation effort.

Showing the Return On Investment for Reliability Testing

We ask users to project how they would expect to rollout chaos engineering at their organization, including unique test types, number of experiments, and coverage across applications. Our ROI calculator will then output a summary and detailed view of your projected savings, implementation costs, and return on investment. When you game out multi-year adoption goals, you'll be building a business case that can help you frame the value of making this type of investment.

If you’re successful in getting buy-in to roll out chaos engineering, you’ll need to report back on your progress. If you’re using an incident management platform like Splunk or PagerDuty, you may already have built-in MTTR metrics available to reference. You can also track the number of critical incidents using Observability tools like Datadog, Dynatrace, or Grafana Labs.

These metrics will hopefully show clear improvements, but your systems may become increasingly complex at the same time that you’re rolling out this testing, especially with the rise of AI agents. Even simply maintaining your current reliability posture as your systems evolve and become significantly more complex could be framed as a win.

Rolling Out Chaos Engineering Across Your Organization

Highly available applications don’t naturally draw attention in the way that outages do. If you want to continue the momentum and foster a culture of reliability, you'll need to intentionally share your wins. For example, if you find a major reliability vulnerability and are able to address it before it impacts customers, that's something to celebrate internally.

If you want to help getting started with chaos testing and adopting a proactive reliability program, our team of experts at Steadybit is ready to help.

You can explore our reliability platform with a 30-day free trial or book a quick call with us to discuss how you can implement chaos engineering and start saving money today.

DEV Community