DEV Community: Steadybit

The Business Case for Chaos Engineering: An ROI Calculator for Testing Application Reliability

Patrick Londa — Tue, 10 Mar 2026 21:41:50 +0000

"What do we get from intentionally injecting failures into our systems?"

Chaos engineering is one of the best ways to proactively test your application reliability, but many leadership teams have never heard of the concept.

Engineering teams need to be able to frame a strong business case to explain the value of chaos engineering and reliability testing to budget holders. When there’s a major outage, the value of application reliability becomes immediately clear, but a strong ROI plan can help earn and maintain executive support while your systems are steady.

We have just released an interactive ROI calculator that can help SRE teams frame the business value of proactive reliability efforts like chaos engineering.

Ensuring Application Resilience with Chaos Engineering

Complex software systems are destined to break and fail at some point, especially with all the factors present in modern production environments. When we ask teams about their approach to system reliability, we often hear back: “We have enough chaos already!”

When done correctly, chaos engineering isn’t adding chaos to your systems. Rather, it’s running different scenarios to validate how resilient your systems are under stressful conditions. You can test your expectations vs. the reality of performance degradation during an availability zone outage, delayed dependency, or a sudden surge of users.

These experiments serve as a feedback loop earlier in the software development cycle that enable teams to design more fault-tolerant systems for their users.

When reliability testing is rolled out across applications, teams can find and address risks that could lead to critical incidents and improve their average time to remediation (MTTR).

Early ROI Prototypes and Why They Didn’t Work

📉 Savings from Reducing the Overall Number of Incidents

Initially, we explored the premise that implementing chaos engineering will lead to fewer incidents overall at every severity tier. If a user provides how many overall incidents they have had in the past year, we could assume some percentage reduction to all incidents. This makes the calculation pretty simple once a cost amount is assigned to each tier of incident.

The trouble with this approach is it assumes that all incidents should be avoided. Incidents as metrics can be useful in flagging anomalous behaviors and some might not necessarily result in negative impacts for many customers. An increase in low-level incidents could actually be a positive sign that alert coverage is improving and properly surfacing system weaknesses.

Moving forward, we chose to focus on savings from reducing the number of critical incidents (Sev0, P1, etc.) year over year. This is a metric that is easier to track without creating incentives to avoid marking incidents at any level.

🧪 Identifying & Fixing Reliability Risks By Running More Experiments

If you go from 0 to 100 experiment runs on your systems, you are bound to discover new performance gaps and reliability risks. How many more risks will you discover at 200 experiment runs?

We built a version of an ROI calculator that assumed that as the number of experiments increased, a certain percentage of experiments would reveal issues at different incident risk levels with assigned potential costs. Teams would then fix a certain percentage of these reliability risks depending on their development capacity. As the experiment run count scaled, there would be diminished returns for revealing new issues.

While it’s true that teams will find more potential reliability issues as they run more experiments, this approach was a little too one-dimensional. We didn’t have well-documented references for how issue detection rates would change, and teams would likely need to create a new reporting mechanism to follow-along as they mitigated risks.

There is also nuance as some teams will automate their experiment runs and use them as regression tests in their CI/CD processes. We decided it would be better to stick to measuring impact by with metrics that are already being tracked and available for SREs at most organizations.

Where We Landed with Key Metrics for Our ROI Calculator

⚡ Savings from Faster Average MTTR

As we were iterating on the inputs and outputs, we saw a great presentation from Keith Blizard and Joe Cho at AWS Re:Invent 2024, featuring a case study on the progress Fidelity Investments had made in rolling out chaos engineering across their organization. They documented major improvements to mean-time-to-resolution (MTTR) as they scaled chaos testing coverage across applications.

We used these case study metrics to plot the correlation between the percent of applications with chaos testing coverage to the incremental positive impact to their MTTR. We then used this relationship to calculate improvements against an assumed industry-wide average MTTR of 175 minutes, according to this 2024 PagerDuty report.

This MTTR savings means fewer minutes of downtime, which that same study estimated can cost between $4,000 - $15,000 per minute. In our calculator, we ask users to input their “Annual Company Revenue” so we can use the most relevant cost of downtime per minute, as downtime is typically more costly for larger enterprises. This 2024 report commissioned by BigPanda found that downtime cost an average of $14,056 per minute for organizations with more than 1,000 employees.

🛡️ Savings from Reducing Critical Incidents

At Steadybit, we partner with a wide range of customers and have seen how many major reliability gaps are uncovered by running chaos experiments. Using insights from our customers and referencing industry studies, we've seen that actively running reliability tests on any given application conservatively leads to an average 30% reduction in critical incidents for that application per year.

For our calculator, we ask users to input the total number of applications their organization operates and how many of these applications have reliability testing coverage. We multiply the standard 30% reduction per year in critical incidents by the percent of applications with testing coverage to get the overall incident reduction for the organization.

🛠️ Costs of Implementing Chaos Engineering

If you want to run chaos experiments at scale, you will likely need to onboard a commercial reliability platform or chaos engineering tool. Open source solutions can be a good starting place, but deploying these across teams and technologies can become increasingly time-intensive. We used general license estimates based on market knowledge and projected experiment activity.

Like with any new program, an organization will need engineers owning the project and dedicating time to a successful rollout of chaos testing. We included a field in our calculator for “Testing Rollout Managers”, measured by FTEs (40hr/week of staff time). We used an average SRE salary of $160k per year as a benchmark to estimate the cost of this implementation effort.

Showing the Return On Investment for Reliability Testing

We ask users to project how they would expect to rollout chaos engineering at their organization, including unique test types, number of experiments, and coverage across applications. Our ROI calculator will then output a summary and detailed view of your projected savings, implementation costs, and return on investment. When you game out multi-year adoption goals, you'll be building a business case that can help you frame the value of making this type of investment.

If you’re successful in getting buy-in to roll out chaos engineering, you’ll need to report back on your progress. If you’re using an incident management platform like Splunk or PagerDuty, you may already have built-in MTTR metrics available to reference. You can also track the number of critical incidents using Observability tools like Datadog, Dynatrace, or Grafana Labs.

These metrics will hopefully show clear improvements, but your systems may become increasingly complex at the same time that you’re rolling out this testing, especially with the rise of AI agents. Even simply maintaining your current reliability posture as your systems evolve and become significantly more complex could be framed as a win.

Rolling Out Chaos Engineering Across Your Organization

Highly available applications don’t naturally draw attention in the way that outages do. If you want to continue the momentum and foster a culture of reliability, you'll need to intentionally share your wins. For example, if you find a major reliability vulnerability and are able to address it before it impacts customers, that's something to celebrate internally.

If you want to help getting started with chaos testing and adopting a proactive reliability program, our team of experts at Steadybit is ready to help.

You can explore our reliability platform with a 30-day free trial or book a quick call with us to discuss how you can implement chaos engineering and start saving money today.

3 Types of Chaos Experiments and How To Run Them

Patrick Londa — Thu, 24 Apr 2025 17:44:42 +0000

The primary objective of a Chaos Experiment is to uncover hidden bugs, weaknesses, or non-obvious points of failure in a system that could lead to significant outages, degradation of service, or system failure under unpredictable real-world conditions.

What is a Chaos Experiment?

A Chaos Experiment is a carefully designed, controlled, and monitored process that systematically introduces disturbances or abnormalities into a system’s operation to observe and understand its response to such conditions.

It forms the core part of ‘Chaos Engineering’, which is predicated on the idea that ‘the best way to understand system behavior is by observing it under stress.’ This means intentionally injecting faults into a system in production or simulated environments to test its reliability and resilience.

This practice emerged from the understanding that systems, especially distributed systems, are inherently complex and unpredictable due to their numerous interactions and dependencies.

The Components of a Chaos Engineering Experiment

Hypothesis Formation. At the initial stage, a hypothesis is formed about the system’s steady-state behavior and expected resilience against certain types of disturbances. This hypothesis predicts no significant deviation in the system’s steady state as a result of the experiment.
Variable Introduction. This involves injecting specific variables or conditions that simulate real-world disturbances (such as network latency, server failures, or resource depletion). These variables are introduced in a controlled manner to avoid unnecessary risk.
Scope and Safety. The experiment’s scope is clearly defined to limit its impact, often called the “blast radius.” Safety mechanisms, such as automatic rollback or kill switches, are implemented to halt the experiment if unexpected negative effects are observed.
Observation and Data Collection. Throughout the experiment, system performance and behavior are closely monitored using detailed logging, metrics, and observability tools. This data collection is critical for analyzing the system’s response to the introduced variables.
Analysis and Learning. After the experiment, the data is analyzed to determine whether the hypothesis was correct. This analysis extracts insights regarding the system’s vulnerabilities, resilience, and performance under stress.
Iterative Improvement. The findings from each chaos experiment inform adjustments in system design, architecture, or operational practices. These adjustments aim to mitigate identified weaknesses and enhance overall resilience.

💡 Note → The ultimate goal is not to break things randomly but to uncover systemic weaknesses to improve the system’s resilience. By introducing chaos, you can enhance the understanding of your systems, leading to higher availability, reliability, and a better user experience.

Types of Chaos Experiments

1. Dependency Failure Experiment

Objective: To assess how microservices behave when one or more of their dependencies fail. In a microservices architecture, services are designed to perform small tasks and often rely on other services to fulfill a request. The failure of these external dependencies can lead to cascading failures across the system, resulting in degraded performance or system outages. Understanding how these failures impact the overall system is crucial for building resilient services.

Possible Experiments

Network Latency and Packet Loss. Simulate increased latency or packet loss to understand its impact on service response times and throughput.
Service Downtime. Emulate the unavailability of a critical service to observe the system’s resilience and failure modes.
Database Connectivity Issues. Introduce connection failures or read/write delays to assess the robustness of data access patterns and caching mechanisms.
Third-party API Limiting. Mimic rate limiting or downtime of third-party APIs to evaluate external dependency management and error handling.

How to Run a Dependency Failure Experiment

Map Out Dependencies.

Begin with a comprehensive inventory of all the external services your system interacts with. This includes databases, third-party APIs, cloud services, and internal services if you work in a microservices architecture.
For each dependency, document how your system interacts with it. Note the data exchanged, request frequency, and criticality of each interaction to your system’s operations.
Rank these dependencies based on their importance to your system’s core functionalities. This will help you focus your efforts on the most critical dependencies first.

Simulate Failures

Use service virtualization or proxy tools like SteadyBit to simulate various failures for your dependencies. These can range from network latency, dropped connections, and timeouts to complete unavailability.
For each dependency, configure the types of faults you want to introduce. This could include delays, error rates, or bandwidth restrictions, mimicking real-world issues that could occur.
Start with less severe faults (like increased latency) and gradually move to more severe conditions (like complete downtime), observing the system’s behavior at each stage.

Test Microservices Isolation

Implement Resilience Patterns. Use libraries like Hystrix, resilience4j, or Spring Cloud Circuit Breaker to implement patterns that prevent failures from cascading across services. This includes:

Bulkheads. Isolate parts of the application into “compartments” to prevent failures in one area from overwhelming others.
Circuit Breakers. Automatically, “cut off” calls to a dependency if it’s detected as down, allowing it to recover without being overwhelmed by constant requests.

Carefully configure thresholds and timeouts for these patterns. This includes setting the appropriate parameters for circuit breakers to trip and recover and defining bulkheads to isolate services effectively.

Monitor Inter-Service Communication

Utilize monitoring solutions like Prometheus, Grafana, or Datadog to monitor how services communicate under normal and failure conditions. Service meshes like Istio or Linkerd can provide detailed insights without changing your application code.
Focus on metrics like request success rates, latency, throughput, and error rates. These metrics will help you understand the impact of dependency failures on your system’s performance and reliability.

💡 Recommendation → Monitoring in real-time allows you to quickly identify and respond to unexpected behaviors, minimizing the impact on your system.

Analyze Fallback Mechanisms

Evaluate the effectiveness of implemented fallback mechanisms. This includes static responses, cache usage, default values, or switching to a secondary service if the primary is unavailable.
Assess if the ‘retry logic’ is appropriately configured. This includes evaluating the retry intervals, backoff strategies, and the maximum number of attempts to prevent overwhelming a failing service.
Ensure that fallback mechanisms enable your system to operate in a degraded mode rather than failing outright. This helps maintain a service level even when dependencies are experiencing issues.

2. Resource Manipulation Experiment

Objective: To understand how a system behaves when subjected to unusual or extreme resource constraints, such as CPU, memory, disk I/O, and network bandwidth. The aim is to identify potential bottlenecks and ensure that the system can handle unexpected spikes in demand without significantly degrading service.

Possible Experiments

CPU Saturation. Increase CPU usage gradually to see how the system prioritizes tasks and whether essential services remain available.
Memory Consumption. Simulate memory leaks or high memory demands to test the system’s handling of low memory conditions.
Disk I/O and Space Exhaustion. Increase disk read/write operations or fill up disk space to observe how the system copes with disk I/O bottlenecks and space limitations.

How to Run a Resource Manipulation Experiment

Define Resource Limits

Start by monitoring your system under normal operating conditions to establish a baseline for CPU, memory, disk I/O, and network bandwidth usage.
Based on historical data and performance metrics, define the normal operating range for each critical resource. This will help you identify when the system is under stress, or resource usage is abnormally high during the experiment.

Check and Verify the Break-Even Point

Understand your system’s maximum capacity before it requires scaling. This involves testing the system under gradually increasing load to identify the point at which performance starts to degrade, and additional resources are needed.
If you’re using auto-scaling (either in the cloud or on-premises), clearly define and verify the rules for adding new instances or allocating resources. This includes setting CPU, memory usage thresholds, and other metrics that trigger scaling actions.
Use load testing tools like JMeter, Gatling, or Locust to simulate demand spikes and verify that your auto-scaling rules work as expected. This will ensure that your system can handle real-world traffic patterns.

Select Manipulation Tool

While Stress and Stress-ng are powerful for generating CPU, memory, and I/O load on Linux systems, they might not be easy to use across distributed or containerized environments. Tools like Steadybit offer more user-friendly interfaces for various environments, including microservices and cloud-native applications.

💡 Pro Tip → Ensure that the tool you select can accurately simulate the types of resource manipulation you’re interested in, whether it’s exhausting CPU cycles, filling up memory, saturating disk I/O, or hogging network bandwidth.

Apply Changes Gradually

Start by applying small changes to resource consumption and monitor the system’s response.
Monitor system performance carefully to identify the thresholds at which performance degrades or fails. This will help you understand the system’s resilience and where improvements are needed.

Monitor System Performance

Use comprehensive monitoring solutions to track the impact of resource manipulation on system performance. Look for changes in response times, throughput, error rates, and system resource utilization.

💡 Pro Tip → Platforms like Steadybit can integrate with monitoring tools to provide a unified view of how resource constraints affect system health, making it easier to correlate actions with outcomes.

Evaluate Resilience

Analyze how effectively your system scales up resources in response to the induced stress. This includes evaluating the timeliness of scaling actions and whether the added resources alleviate the performance issues.
Evaluate the efficiency of your resource allocation algorithms. This involves assessing whether resources are being utilized optimally and whether unnecessary wastage or contention exists.
Test the robustness of your failover and redundancy mechanisms under ‘conditions of resource scarcity’. This can include switching to standby systems, redistributing load among available resources, or degrading service gracefully.

3. Network Disruption Experiment

Objective: To simulate various network conditions that can affect a system’s operations, such as outages, DNS failures, or limited network access. By introducing these disruptions, the experiment seeks to understand how a system responds and adapts to network unreliability, ensuring critical applications can withstand and recover from real-world network issues.

Possible Experiments

DNS Failures. Introduce DNS resolution issues to evaluate the system’s reliance on DNS and its ability to use fallback DNS services.
Latency Injection. Introduce artificial delay in the network to simulate high-latency conditions, affecting the communication between services or components.
Packet Loss Simulation. Simulate the loss of data packets in the network to test how well the system handles data transmission errors and retries.
Bandwidth Throttling. Limit the network bandwidth available to the application, simulating congestion conditions or degraded network services.
Connection Drops. Forcing abrupt disconnections or intermittent connectivity to test session persistence and reconnection strategies.

How to Run a Network Disruption Experiment

Identify Network Paths

Start by mapping out your network’s topology, including routers, switches, gateways, and the connections between different segments. Tools like Nmap or network diagram software can help visualize your network’s structure.
Focus on identifying the critical paths data takes when traveling through your system. These include paths between microservices, external APIs, databases, and the Internet.
Document these paths and prioritize them based on their importance to your system’s operation. This will help you decide where to start with your network disruption experiments.

Choose Disruption Type

Decide on the type of network disruption to simulate. Options include:

complete network outages,
latency (delays in data transmission)
packet loss (data packets being lost during transmission)
bandwidth limitations

Next, choose disruptions based on their likelihood and potential impact on your system.
For example, simulating latency and packet loss might be particularly relevant if your system is distributed across multiple geographic locations.

Use Network Chaos Tools

Traffic Control (TC). The ‘tc’ command in Linux is a powerful tool for controlling network traffic. It allows you to introduce delays, packet loss, and bandwidth restrictions on your network interfaces.

⚠️ Note → Simulating DNS failures can be complex but is crucial for understanding how your system reacts to DNS resolution issues. Consider using specialized tools or features for this purpose.

On the flip side, chaos experiment solutions like Steadybit provide user-friendly interfaces for simulating network disruptions. For example, you get safety features like built-in rollback strategies to minimize the risk of long-term impact on your system.

Monitor Connectivity and Throughput

During the experiment, use network monitoring tools and observability platforms to track connectivity and throughput metrics in real time.
Focus on monitoring packet loss rates, latency, bandwidth usage, and error rates to assess the impact of the network disruptions you’re simulating.

Assess Failover and Recovery

Evaluate how well your system’s failover mechanisms respond to network disruptions. For example, you could switch to a redundant network path, use a different DNS server, or take other predefined recovery actions.
Measure the time it takes for the system to detect and recover the issue. This includes the time it takes to failover and return to normal operations after the disruption ends.

💡 Recommended → Analyze the overall resilience of your system to network instability. This assessment should include how well services degrade (if at all) and how quickly and effectively they recover once normal conditions are restored.

If you want to read more, you can check out the rest of the post about principles of chaos engineering and popular tools here.